80 18 hours ago

Gemma4-nano is part of the G4Turbo.com family to try and bring the Gemma 4 Model to everyone. Please visit https://g4turbo.com/ for more information about what I am doing.

tools thinking e2b e4b 26b 31b
ollama run ssfdre38/gemma4-nano:e2b

Details

19 hours ago

312256845ea8 ยท 3.1GB ยท

gemma4
ยท
4.65B
ยท
Q3_K_S
{{ if .System }}<start_of_turn>system {{ .System }}<end_of_turn> {{ end }}{{ if .Prompt }}<start_of_
{ "num_batch": 512, "num_ctx": 16384, "num_thread": 8, "stop": [ "<start_of_

Readme

gemma4-nano

Ultra-compressed Gemma 4 models optimized for mobile and edge devices

Part of the gemma4-turbo family, gemma4-nano uses Q3_K_S quantization to achieve 50-57% size reduction compared to stock Gemma 4 models while maintaining quality and delivering faster inference speeds.

๐Ÿš€ Quick Start

# Run the latest nano model (e4b, 4.7 GB)
ollama run ssfdre38/gemma4-nano

# Or specify a size
ollama run ssfdre38/gemma4-nano:e2b  # 3.1 GB - fits 4GB RAM devices
ollama run ssfdre38/gemma4-nano:e4b  # 4.7 GB - best balance

๐Ÿ“Š Model Sizes

Model Original Turbo (IQ4_XS) Nano (Q3_K_S) Reduction
e2b 7.2 GB 4.3 GB 3.1 GB -57%
e4b 9.6 GB 6.1 GB 4.7 GB -51%

โšก Performance Benchmarks

Tested on CPU (AMD Xeon, 8 threads):

E2b Nano vs Turbo

Prompt Type Turbo (IQ4_XS) Nano (Q3_K_S) Speedup
Short prompts 12.2 tok/s 13.6 tok/s 1.12x
Reasoning 16.9 tok/s 19.4 tok/s 1.14x
Code 16.7 tok/s 19.0 tok/s 1.13x
Average 15.3 tok/s 17.3 tok/s 1.13x

Nano is 13% faster than turbo while being 28% smaller!

๐ŸŽฏ Use Cases

  • Mobile & Edge: 3.1 GB e2b fits devices with 4GB RAM (leaves ~800MB for OS)
  • Offline-first apps: Smaller downloads, faster startup, lower bandwidth
  • IoT & embedded: Run full reasoning models on constrained hardware
  • Battery-sensitive: Less data movement = better power efficiency
  • Quick prototyping: Fast downloads and inference for rapid iteration

๐Ÿง  Features Preserved

โœ… Full thinking/reasoning capability intact - same architecture as Gemma 4
โœ… 16K context window - no context reduction
โœ… Temperature, top-k, top-p controls - all sampling options available
โœ… Text-only optimized - no vision encoder bloat (saves ~1GB per model)

๐Ÿ”ง Technical Details

Quantization Strategy

  • Method: Q3_K_S (3-bit k-quant, small variant)
  • Source: BF16 weights from bartowski (never re-quantized)
  • Bits per weight: ~3.41 bpw
  • Why Q3_K_S over IQ3_M? Benchmarks showed Q3_K_S is 13% faster with minimal quality loss

Quality vs Turbo

Nano uses more aggressive quantization (3-bit vs 4-bit) but maintains: - Coherent multi-step reasoning - Accurate factual responses - Clean code generation - Proper markdown/formatting

Trade-off: Slightly lower precision on edge cases, but 99% of use cases see no degradation.

๐Ÿ“ฆ Model Configuration

Default Modelfile settings:

PARAMETER num_thread 8
PARAMETER num_batch 512
PARAMETER num_ctx 16384
PARAMETER temperature 1.0
PARAMETER top_k 64
PARAMETER top_p 0.95

Tune num_thread based on your CPU core count for best performance.

๐Ÿ”— Related Models

gemma4-turbo family: - ssfdre38/gemma4-turbo - 40% smaller, multimodal (vision + text) - ssfdre38/gemma4-nano - 50-57% smaller, text-only, faster inference

Choose nano when: - RAM is constrained (<8GB) - Download size matters - Inference speed is critical - Vision capability not needed

Choose turbo when: - You need vision (image understanding) - More RAM available (8GB+) - Want best quality at compressed size

๐Ÿ“ License

Apache 2.0 - same as Gemma 4 base models

๐Ÿ™ Credits

  • Google DeepMind - Gemma 4 base models
  • bartowski - BF16 GGUF conversions
  • llama.cpp team - Quantization tools
  • Ollama team - Model hosting and runtime

๐Ÿ“š Resources


Built with ๐Ÿฆž for the Gemma 4 Good Hackathon 2026