418.9K Downloads Updated 2 months ago
Updated 2 months ago
2 months ago
15cb39fd9394 · 7.5GB
Gemma 3n models are designed for efficient execution on everyday devices such as laptops, tablets or phones. These models were trained with data in over 140 spoken languages.
Gemma 3n models use selective parameter activation technology to reduce resource requirements. This technique allows the models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain.
ollama run gemma3n:e2b
ollama run gemma3n:e4b
Model evaluation metrics and results.
These models were evaluated at full precision (float32) against a large collection of different datasets and metrics to cover different aspects of content generation. Evaluation results marked with IT are for instruction-tuned models. Evaluation results marked with PT are for pre-trained models. The models available on Ollama are instruction-tuned models.
Benchmark | Metric | n-shot | E2B PT | E4B PT |
---|---|---|---|---|
HellaSwag | Accuracy | 10-shot | 72.2 | 78.6 |
BoolQ | Accuracy | 0-shot | 76.4 | 81.6 |
PIQA | Accuracy | 0-shot | 78.9 | 81.0 |
SocialIQA | Accuracy | 0-shot | 48.8 | 50.0 |
TriviaQA | Accuracy | 5-shot | 60.8 | 70.2 |
Natural Questions | Accuracy | 5-shot | 15.5 | 20.9 |
ARC-c | Accuracy | 25-shot | 51.7 | 61.6 |
ARC-e | Accuracy | 0-shot | 75.8 | 81.6 |
WinoGrande | Accuracy | 5-shot | 66.8 | 71.7 |
BIG-Bench Hard | Accuracy | few-shot | 44.3 | 52.9 |
DROP | Token F1 score | 1-shot | 53.9 | 60.8 |
Benchmark | Metric | n-shot | E2B IT | E4B IT |
---|---|---|---|---|
MGSM | Accuracy | 0-shot | 53.1 | 60.7 |
WMT24++ (ChrF) | Character-level F-score | 0-shot | 42.7 | 50.1 |
Include | Accuracy | 0-shot | 38.6 | 57.2 |
MMLU (ProX) | Accuracy | 0-shot | 8.1 | 19.9 |
OpenAI MMLU | Accuracy | 0-shot | 22.3 | 35.6 |
Global-MMLU | Accuracy | 0-shot | 55.1 | 60.3 |
ECLeKTic | ECLeKTic score | 0-shot | 2.5 | 1.9 |
Benchmark | Metric | n-shot | E2B IT | E4B IT |
---|---|---|---|---|
GPQA Diamond | RelaxedAccuracy/accuracy | 0-shot | 24.8 | 23.7 |
LiveCodeBench v5 | pass@1 | 0-shot | 18.6 | 25.7 |
Codegolf v2.2 | pass@1 | 0-shot | 11.0 | 16.8 |
AIME 2025 | Accuracy | 0-shot | 6.7 | 11.6 |
Benchmark | Metric | n-shot | E2B IT | E4B IT |
---|---|---|---|---|
MMLU | Accuracy | 0-shot | 60.1 | 64.9 |
MBPP | pass@1 | 3-shot | 56.6 | 63.6 |
HumanEval | pass@1 | 0-shot | 66.5 | 75.0 |
LiveCodeBench | pass@1 | 0-shot | 13.2 | 13.2 |
HiddenMath | Accuracy | 0-shot | 27.7 | 37.7 |
Global-MMLU-Lite | Accuracy | 0-shot | 59.0 | 64.5 |
MMLU (Pro) | Accuracy | 0-shot | 40.5 | 50.6 |
These models have certain limitations that users should be aware of.
Open generative models have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development.
Ethics and safety evaluation approach and results.
Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including:
In addition to development level evaluations, we conduct “assurance evaluations” which are our ‘arms-length’ internal evaluations for responsibility governance decision making. They are conducted separately from the model development team, to inform decision making about release. High level findings are fed back to the model team, but prompt sets are held-out to prevent overfitting and preserve the results’ ability to inform decision making. Notable assurance evaluation results are reported to our Responsibility & Safety Council as part of release review.
For all areas of safety testing, we saw safe levels of performance across the categories of child safety, content safety, and representational harms relative to previous Gemma models. All testing was conducted without safety filters to evaluate the model capabilities and behaviors. For text-to-text, image-to-text, and audio-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models’ performance with respect to high severity violations. A limitation of our evaluations was they included primarily English language prompts.