Gemma-2B: Beyond the Basics

6 min readFeb 29, 2024

In the ever-evolving landscape of AI, one name has emerged to redefine the boundaries of what’s possible — Gemma. Today, in this blog post, we will delve into the intricacies of the Gemma-2B, uncovering its key features, use cases, and best practices. 🌟

What is Gemma?

The Latin root of the Italian feminine name Gemma means “gem” or “precious stone”. Throughout the 1980s, the name was among the most popular in Scotland and England. It conveys the image of something priceless and exquisite, similar to a pearl.

What is Google’s Gemma?

Gemma is a family of lightweight, state-of-the-art open models Developed by Google DeepMind and various other Google teams. It is inspired by the larger Gemini models, built from the same research and technology.

There are 4 new LLM models available. It comes in two sizes: 2B and 7B parameters. Each size is available with a base (pre-trained) and instruction-tuned version. These are text-to-text, decoder-only large language models, available in English, with open weights. These models work well for many text-generating tasks, such as summarizing reasoning and answering questions. Due to their relatively modest size, it’s possible to deploy them in dev environments with constrained resources, such as a desktop, a laptop, or your own preferred cloud infrastructure. This democratizes access to cutting-edge AI models and encourages creativity for all.

These models are intended to help academics and developers responsibly create AI. The Gemma models’ terms of use allow researchers, developers, and commercial users to freely access and redistribute. It is also permissible for users to develop and share their model variations. Developers who will be using Gemma models pledge not to use them for bad purposes, demonstrating our commitment to ethically developing AI while facilitating greater access to this technology.

Gemma 2B

The Gemma 2B model is quite intriguing due to its compact size. It has 2 billion parameters, which ensures that it has a small footprint. It is a great option for those who have memory constraints and want to focus on efficiency. However, it does not perform as well as other similarly sized models, such as Phi 2, in the leaderboard.

Image Source: blog/gemma.md at main · huggingface/blog · GitHub

Model Architecture

The transformer decoder serves as the foundation for the Gemma model architecture (Vaswani et al., 2017). Based on ablation studies that showed corresponding attention variations increased performance at each scale, the 2B model uses multi-query attention (with 𝑛𝑢𝑚_𝑘𝑣_ℏ𝑒𝑎𝑑𝑠 = 1).

Memorization

The memorization rates across different model families and found that Gemma pre-trained models had similar low memorization rates as PaLM and PaLM 2 models of comparable size.

Significant areas where the Gemma 2B excels:

Summarizing assignments

The capacity to distill knowledge into concise synopses is beneficial in various fields. Gemma 2B can automatically produce succinct research paper abstracts, condense news stories for easy reading, and even produce meeting transcripts with the most important points underlined. For professionals, researchers, and students alike, this reduces waiting times and boosts productivity. These are only a handful of the fascinating opportunities that Gemma 2B offers. Its remarkable performance and lightweight design make AI advancements possible in situations where larger, more computationally intensive models would not be practical.

Conversational AI and Chatbots

Gemma 2B is excellent at creating interactions that are realistic and sensitive to context. This has the potential to transform customer care chatbots by offering more complex and beneficial interactions than just answering frequently asked questions. Furthermore, Gemma 2B could power elderly people’s virtual companions, conversing with them in a context-aware manner to prevent loneliness and stimulate their minds.

Mobile and edge devices

With Gemma 2B, AI-powered functionality can be accessed without the requirement for strong cloud connectivity. Imagine intelligent assistants for smart home gadgets, offline text summarizing tools for students with a restricted internet connection, or on-device language translation for travelers. These situations are made feasible by Gemma 2B, which offers strong AI capabilities within the limitations of hardware with constrained resources.

Test and Play🎉

Google’s Gemma AI models work effortlessly with well-known deep learning frameworks such as JAX, PyTorch, and TensorFlow (In the backend) through native Keras 3.0.

Integrations:

Colab : Ready to use.
Hugging Face Transformers: Leverage popular libraries and pre-built pipelines.
Kaggle: Access Gemma models, community discussions, and examples.
MaxText : Enables scaling up to ten of thousands of Clouds TPUs chips.
Nvidia NeMo: Framework for GenAI designed for researchers and Pytorch developers working on text-to-speech synthesis (TTS), large language models (LLMs), multimodal models (MM), and automatic speech recognition (ASR).
TensorRT-LLM: Offers a user-friendly Python API to define Large Language Models (LLMs).

Optimizations:

Industry-leading performance is achieved through optimization across multiple AI hardware platforms, including NVIDIA GPUs and Google Cloud TPUs.

Understanding the sample Gemma-2B code

#Package installations
pip install accelerate
pip install -U transformers

#Imports necessary libraries for HuggingFace authentication
import os
from huggingface_hub import login

#Authentication 
HUGGINGFACE_TOKEN = os.environ.get("HUGGINGFACE_TOKEN")
login(token=HUGGINGFACE_TOKEN)

from transformers import AutoTokenizer, AutoModelForCausalLM

#Loads the tokenizer and the language models
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b")

#prompt 
input_text = "I am Anushka and code"

#Tokenizes the input text using the loaded tokenizer
#And returns the tokenized input in PyTorch tensor format 
input_ids = tokenizer(input_text, return_tensors="pt")

#Generates text based on the tokenized input 
outputs = model.generate(**input_ids)

#Decodes the generated output tokens into human-readable text
print(tokenizer.decode(outputs[0]))

decoded_output = tokenizer.decode(outputs[0], skip_special_token=True)
print(decoded_output)

# Remove specific HTML-like tags
cleaned_output = re.sub(r'<.*?>', '', decoded_output)
print(cleaned_output)

Output

The 1st line of output looks like <bos> I am Anushka and code is my passion. I am a Full.

The presence of <bos> at the beginning of the output indicates the model-generated "beginning of sequence" token. This token is often used to denote the start of a generated sequence. If you don't want this token in your output, you can modify the decoding process to exclude it.

Regarding the inclusion of HTML-like tags in the output, it seems that the model has generated content that includes these tags, possibly due to the training data it was exposed to. If you want to remove specific tags like  and  from the output, you can use a text cleaning step as demonstrated above.

Summary

Gemma stands as an openly accessible family of generative language models designed for both text and code generation. It has demonstrated impressive performance on established benchmarks compared to existing models. The aim is to propel it towards the best and most ethical utilization in various applications.

Bonus resources for you!!🤩

If you enjoyed reading this blog, a round of applause (claps 👏) would be greatly appreciated! Additionally, don’t forget to hit that follow button for more content like this in the future. Thank you! 💖