913 Downloads Updated 1 month ago
ollama run batiai/gemma4-31b:q4
Quantized directly from official Google BF16 weights. Dense 31 B — every parameter active per token (denser computation than 26B-A4B’s MoE). Text-only here on Ollama; multimodal (vision: image + video) opt-in via HF + llama.cpp (see bottom).
| Tag | Size | VRAM | M4 Pro (48GB) | M4 Max (128GB) | Use Case |
|---|---|---|---|---|---|
| iq4 (recommended) | 16GB | 26GB | 13.5 t/s | 22.8 t/s | 48GB+ Mac, best speed+quality |
| iq3 | 13GB | ~24GB | 12.2 t/s | 20.7 t/s | 48GB+ Mac, slightly smaller |
| q4 | 17GB | ~27GB | — | 19.1 t/s | 48GB+ Mac, standard |
| q6 | 23GB | ~32GB | ❌ tight | 6.6 t/s | 64GB+ Mac only |
ollama run batiai/gemma4-31b:iq4
Same as 26B — imatrix optimization makes IQ4 both smaller and faster than Q4_K_M:
| IQ4_XS | Q4_K_M | |
|---|---|---|
| Size | 16GB | 17GB |
| VRAM | 41GB | 43GB |
| Speed | 22.8 t/s | 19.1 t/s |
| Quality | 4-bit imatrix | 4-bit standard |
| Your Mac RAM | IQ3 (13GB) | IQ4 (16GB) | Q4 (17GB) | Q6 (23GB) |
|---|---|---|---|---|
| 16GB | ❌ | ❌ | ❌ | ❌ |
| 32GB | ❌ swap | ❌ swap | ❌ swap | ❌ |
| 48GB | 12.2 t/s | 13.5 t/s ✅ | ⚠️ tight | ❌ |
| 64GB | ✅ Fast | ✅ Fast | ✅ Fast | ⚠️ Tight |
| 128GB | 20.7 t/s | 22.8 t/s | 19.1 t/s | 6.6 t/s* |
*Q6_K on 128GB Mac runs slow due to memory bandwidth limits, not VRAM.
We measured both models on the same 48GB Mac:
| Metric | 31B IQ4 | 26B IQ4 (MoE) |
|---|---|---|
| Speed | 13.5 t/s | 58–63 t/s (4x faster) |
| VRAM | 26.1 GB (37% free) | 15.1 GB (58% free) |
| Cold start | 40 seconds | 1.7 seconds |
| Simple response | 1.5s | 0.4s |
| Coding task | 28.5s | 6.8s |
26B MoE wins on every axis for 48GB Mac. Use 31B only if you specifically need its higher quality on complex reasoning tasks (and have 64GB+ for comfortable headroom).
# 16GB Mac
ollama run batiai/gemma4-e4b:q4 # 57.1 t/s, 10GB VRAM
# 24~48GB Mac (recommended for most users)
ollama run batiai/gemma4-26b:iq4 # 58-63 t/s on 48GB Mac, MoE architecture
Free, on-device AI automation for Mac. 5MB app, 100% local, unlimited.
This Ollama tag is text-only — Ollama’s mmproj integration is still rough today. For image / video understanding, grab the main GGUF + the vision projector from HF and run with llama.cpp:
wget https://huggingface.co/batiai/Gemma-4-31B-it-GGUF/resolve/main/google-gemma-4-31B-it-IQ4_XS.gguf
wget https://huggingface.co/batiai/Gemma-4-31B-it-GGUF/resolve/main/mmproj-Q6_K.gguf
llama-server -m google-gemma-4-31B-it-IQ4_XS.gguf \
--mmproj mmproj-Q6_K.gguf -c 32768 --port 8080
Audio is NOT supported in 26B/31B (vision only). For audio, use batiai/gemma4-e2b or batiai/gemma4-e4b.
| 31B-it (dense) | 26B-A4B-it (MoE) | |
|---|---|---|
| Active params/token | 31 B | 3.8 B |
| Throughput | slower | faster |
| Reasoning depth | deeper per token | good |
| Best for | hard reasoning | high-throughput / agents |