Gemma 3n Description
Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. Gemma 3n models are designed for efficient execution on low-resource devices.
They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for pre-trained and instruction-tuned variants. These models were trained with data in over 140 spoken languages.
Gemma 3n models use selective parameter activation technology to reduce resource requirements. This technique allows the models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain.
Inputs and Outputs
Input:
- Text string, such as a question, a prompt, or a document to be summarized
- Images, normalized to 256x256, 512x512, or 768x768 resolution and encoded to 256 tokens each
- Audio data encoded to 6.25 tokens per second from a single channel
- Total input context of 32K tokens
Output:
- Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document
- Total output length up to 32K tokens, subtracting the request input tokens
Training Dataset
These models were trained on a dataset that includes a wide variety of sources totalling approximately 11 trillion tokens. The knowledge cutoff date for the training data was June 2024.
Key components:
- Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages.
- Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions.
- Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and to address mathematical queries.
- Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks.
- Audio: A diverse set of sound samples enables the model to recognize speech, transcribe text from recordings, and identify information in audio data.
Data Preprocessing
Key data cleaning and filtering methods applied to the training data:
- CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content.
- Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets.
- Additional methods: Filtering based on content quality and safety in line with our policies.
Implementation Information
Hardware
Gemma was trained using Tensor Processing Unit (TPU) hardware (TPUv4p, TPUv5p and TPUv5e). Training generative models requires significant computational power. TPUs offer several advantages:
- Performance: Specifically designed to handle the massive computations involved in training generative models
- Memory: Large amounts of high-bandwidth memory for handling large models and batch sizes
- Scalability: TPU Pods provide scalable solutions for handling growing complexity
- Cost-effectiveness: More cost-effective solution compared to CPU-based infrastructure
Software
Training was done using JAX and ML Pathways. JAX allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models.
Benchmark |
Metric |
n-shot |
E2B PT |
E4B PT |
HellaSwag | Accuracy | 10-shot | 72.2 | 78.6 |
BoolQ | Accuracy | 0-shot | 76.4 | 81.6 |
PIQA | Accuracy | 0-shot | 78.9 | 81.0 |
SocialIQA | Accuracy | 0-shot | 48.8 | 50.0 |
TriviaQA | Accuracy | 5-shot | 60.8 | 70.2 |
Natural Questions | Accuracy | 5-shot | 15.5 | 20.9 |
ARC-c | Accuracy | 25-shot | 51.7 | 61.6 |
ARC-e | Accuracy | 0-shot | 75.8 | 81.6 |
WinoGrande | Accuracy | 5-shot | 66.8 | 71.7 |
BIG-Bench Hard | Accuracy | few-shot | 44.3 | 52.9 |
DROP | Token F1 score | 1-shot | 53.9 | 60.8 |
Intended Usage
Content Creation and Communication
- Text Generation: Generate creative text formats such as poems, scripts, code, marketing copy, and email drafts
- Chatbots and Conversational AI: Power conversational interfaces for customer service, virtual assistants, or interactive applications
- Text Summarization: Generate concise summaries of a text corpus, research papers, or reports
- Image Data Extraction: Extract, interpret, and summarize visual data for text communications
- Audio Data Extraction: Transcribe spoken language, translate speech to text in other languages, and analyze sound-based data
Research and Education
- NLP Research: These models can serve as a foundation for researchers to experiment with generative models and NLP techniques
- Language Learning Tools: Support interactive language learning experiences, aiding in grammar correction or providing writing practice
- Knowledge Exploration: Assist researchers in exploring large bodies of data by generating summaries or answering questions about specific topics
Limitations
- Training Data: The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses.
- Context and Task Complexity: Models are better at tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging.
- Language Ambiguity and Nuance: Natural language is inherently complex. Models might struggle to grasp subtle nuances, sarcasm, or figurative language.
- Factual Accuracy: Models generate responses based on information they learned from their training datasets, but they are not knowledge bases. They may generate incorrect or outdated factual statements.
- Common Sense: Models rely on statistical patterns in language. They might lack the ability to apply common sense reasoning in certain situations.