Gemma 4 31B-it — Quantized by BatiAI

Quantized directly from official Google BF16 weights. Dense 31 B — every parameter active per token (denser computation than 26B-A4B’s MoE). Text-only here on Ollama; multimodal (vision: image + video) opt-in via HF + llama.cpp (see bottom).

Models

Tag	Size	VRAM	M4 Pro (48GB)	M4 Max (128GB)	Use Case
iq4 (recommended)	16GB	26GB	13.5 t/s	22.8 t/s	48GB+ Mac, best speed+quality
iq3	13GB	~24GB	12.2 t/s	20.7 t/s	48GB+ Mac, slightly smaller
q4	17GB	~27GB	—	19.1 t/s	48GB+ Mac, standard
q6	23GB	~32GB	❌ tight	6.6 t/s	64GB+ Mac only

Quick Start

ollama run batiai/gemma4-31b:iq4

Why IQ4_XS is Best

Same as 26B — imatrix optimization makes IQ4 both smaller and faster than Q4_K_M:

	IQ4_XS	Q4_K_M
Size	16GB	17GB
VRAM	41GB	43GB
Speed	22.8 t/s	19.1 t/s
Quality	4-bit imatrix	4-bit standard

RAM Requirements — Honest

Your Mac RAM	IQ3 (13GB)	IQ4 (16GB)	Q4 (17GB)	Q6 (23GB)
16GB	❌	❌	❌	❌
32GB	❌ swap	❌ swap	❌ swap	❌
48GB	12.2 t/s	13.5 t/s ✅	⚠️ tight	❌
64GB	✅ Fast	✅ Fast	✅ Fast	⚠️ Tight
128GB	20.7 t/s	22.8 t/s	19.1 t/s	6.6 t/s*

*Q6_K on 128GB Mac runs slow due to memory bandwidth limits, not VRAM.

31B vs 26B on M4 Pro 48GB — Real Numbers

We measured both models on the same 48GB Mac:

Metric	31B IQ4	26B IQ4 (MoE)
Speed	13.5 t/s	58–63 t/s (4x faster)
VRAM	26.1 GB (37% free)	15.1 GB (58% free)
Cold start	40 seconds	1.7 seconds
Simple response	1.5s	0.4s
Coding task	28.5s	6.8s

26B MoE wins on every axis for 48GB Mac. Use 31B only if you specifically need its higher quality on complex reasoning tasks (and have 64GB+ for comfortable headroom).

Smaller Macs — Use These Instead

# 16GB Mac
ollama run batiai/gemma4-e4b:q4     # 57.1 t/s, 10GB VRAM

# 24~48GB Mac (recommended for most users)
ollama run batiai/gemma4-26b:iq4    # 58-63 t/s on 48GB Mac, MoE architecture

Why BatiAI?

Quantized directly from official Google weights (not third-party)
imatrix optimized (IQ3, IQ4) for best quality at each size
Third-party GGUFs (unsloth) fail on Ollama 0.20+ — ours work
Verified on MacBook Pro M4 Max (128GB)
Korean, tool calling, JSON generation all tested

Built for BatiFlow

Free, on-device AI automation for Mac. 5MB app, 100% local, unlimited.

https://flow.bati.ai

Multimodal mode (opt-in, HF + llama.cpp)

This Ollama tag is text-only — Ollama’s mmproj integration is still rough today. For image / video understanding, grab the main GGUF + the vision projector from HF and run with llama.cpp:

wget https://huggingface.co/batiai/Gemma-4-31B-it-GGUF/resolve/main/google-gemma-4-31B-it-IQ4_XS.gguf
wget https://huggingface.co/batiai/Gemma-4-31B-it-GGUF/resolve/main/mmproj-Q6_K.gguf

llama-server -m google-gemma-4-31B-it-IQ4_XS.gguf \
  --mmproj mmproj-Q6_K.gguf -c 32768 --port 8080

Audio is NOT supported in 26B/31B (vision only). For audio, use batiai/gemma4-e2b or batiai/gemma4-e4b.

Dense vs MoE — when to pick 31B vs 26B-A4B?

	31B-it (dense)	26B-A4B-it (MoE)
Active params/token	31 B	3.8 B
Throughput	slower	faster
Reasoning depth	deeper per token	good
Best for	hard reasoning	high-throughput / agents

Details

Readme