3,341 1 month ago

Gemma 4 26B MoE quantized by BatiAI. 77 t/s on M4 Max. Requires 24GB+ Mac.

tools thinking
ollama run batiai/gemma4-26b:q6

Details

1 month ago

2226bf6ca3ca · 23GB ·

gemma4
·
25.2B
·
Q6_K
You are a helpful AI assistant.
{ "num_ctx": 131072, "stop": [ "<turn|>" ], "temperature": 0.7 }

Readme

Gemma 4 26B-A4B-it — Quantized by BatiAI

Quantized directly from official Google BF16 weights. MoE design: 26 B total parameters, ~3.8 B active per token. Text-only here on Ollama; multimodal (vision: image + video) opt-in via HF + llama.cpp (see bottom).

Models

Tag Size VRAM M4 Pro (48GB) M4 Max (128GB) Use Case
iq4 13GB 15GB 58–63 t/s 85.8 t/s 24GB+ Mac, recommended
iq3 12GB 14GB 77 t/s 24GB Mac, slightly smaller
q3 13GB 15GB 70.7 t/s 24GB Mac, standard
q4 16GB 18GB 74.9 t/s 32GB+ Mac
q6 21GB 24GB 48–50 t/s 74.8 t/s 36GB+ Mac, highest quality

Quick Start

ollama run batiai/gemma4-26b:iq4

Why IQ4? — Fastest AND Smartest

IQ4 uses importance-matrix quantization: calibration data tells which weights matter most, compressing aggressively where it doesn’t matter.

IQ4_XS (BatiAI) Q4_K_M (standard)
Size 13GB 16GB
Speed (M4 Pro 48GB) 58–63 t/s
Speed (M4 Max 128GB) 85.8 t/s 74.9 t/s
Quality 4-bit imatrix 4-bit standard

Same 4-bit quality, 3GB smaller file. Verified with translation, tool calling, and math reasoning — identical output quality.

M4 Pro 48GB — Real User Benchmark

Measured on real Mac hardware (M4 Pro, 48GB unified memory):

Model Size VRAM Speed Cold start System free
BatiAI 26B IQ4 13GB 15.1GB 58–63 t/s 1.7s 58%
BatiAI 26B Q6 21GB 23.9GB 48–50 t/s 5.8s 40%
Ollama 26B (official) 14GB 19.3GB 56 t/s 3.4s 50%
31B IQ4 (Dense) 16GB 26.1GB 13.5 t/s 40s 37%

Key findings on 48GB Mac: - BatiAI IQ4 is faster than Ollama’s official 26B (58-63 vs 56 t/s) - 4x faster than 31B Dense with similar quality - Fastest cold start (1.7s) — imatrix 4-bit loads cleanest on Apple Silicon - Most system memory free (58%) — best for multitasking

Why IQ4 beats IQ3 on Apple Silicon

Counter-intuitively, IQ4 (13GB) is faster than IQ3 (12GB) on M-series chips:

  • 4-bit alignment — CPU/GPU processes 4-bit cleanly, SIMD-friendly
  • 3-bit packing — misaligned, complex lookup tables, SIMD inefficient
  • Memory read savings < dequantize overhead → IQ4 wins

Smaller file ≠ faster on Apple Silicon when it comes to 3-bit vs 4-bit.

RAM Requirements — Be Honest

Your Mac RAM IQ3 (12GB) IQ4 (13GB) Q3 (13GB) Q4 (16GB) Q6 (21GB)
16GB ❌ swap ❌ swap ❌ swap ❌ Won’t fit ❌ Won’t fit
24GB ✅ Fast ✅ Fits ⚠️ Tight ❌ Barely ❌ No
32GB ✅ Fast ✅ Fast ✅ Fast ✅ OK ❌ No
36GB+ ✅ Fast ✅ Fast ✅ Fast ✅ Fast ✅ Fits
128GB 77 t/s 85.8 t/s 70.7 t/s 74.9 t/s 74.8 t/s

16GB Mac Users

26B models don’t work on 16GB Mac. Use these instead:

ollama run batiai/gemma4-e4b    # 57.1 t/s on 16GB Mac ✅
ollama run batiai/qwen3.5-9b    # 12.5 t/s on 16GB Mac ✅

Why BatiAI?

  • Quantized directly from official Google weights (not third-party)
  • imatrix optimized (IQ3, IQ4) for best quality at each size
  • Third-party GGUFs (unsloth) fail on Ollama 0.20+ — ours work
  • Verified on Mac mini M4 (16GB) + MacBook Pro M4 Max (128GB)
  • Vision: mmproj available on HuggingFace (Ollama vision pending ecosystem fix)
  • Korean, tool calling, JSON generation all tested

Built for BatiFlow

Free, on-device AI automation for Mac. 5MB app, 100% local, unlimited.

https://flow.bati.ai

Multimodal mode (opt-in, HF + llama.cpp)

This Ollama tag is text-only — Ollama’s mmproj integration is still rough today. For image / video understanding, grab the main GGUF + the vision projector from HF and run with llama.cpp:

# Main model + vision projector
wget https://huggingface.co/batiai/Gemma-4-26B-A4B-it-GGUF/resolve/main/google-gemma-4-26B-A4B-it-IQ4_XS.gguf
wget https://huggingface.co/batiai/Gemma-4-26B-A4B-it-GGUF/resolve/main/mmproj-Q6_K.gguf

llama-server -m google-gemma-4-26B-A4B-it-IQ4_XS.gguf \
  --mmproj mmproj-Q6_K.gguf -c 32768 --port 8080

Audio is NOT supported in 26B/31B (vision only). For audio, use batiai/gemma4-e2b or batiai/gemma4-e4b.