20 3 days ago

Gemma 4 31B dense, vision + native tool calling.

vision tools thinking
ollama run odytrice/gemma4-31b:5090

Details

3 days ago

9a290d5cccce · 20GB ·

gemma4
·
31.3B
·
Q4_K_M
Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR US
{ "num_ctx": 153600, "num_gpu": 999, "temperature": 1, "top_k": 64, "top_p": 0.9

Readme

Gemma 4 31B

Gemma 4 31B dense, vision + native tool calling.

Model card for odytrice/gemma4-31b:5090. The dense 31B at Q4_K_M (~19 GB) does not leave usable KV cache headroom on a 24 GB 4090, so only a 5090 profile is provided.

Upstream

Field Value
Upstream google/gemma-4-31B-it
NVFP4 source nvidia/Gemma-4-31B-IT-NVFP4
Family Gemma 4 (Google)
Architecture Dense
Params ~31B (33B on HF card)
Modalities Text + Image (vision)
Languages 140+
Tool calling Native (structured JSON)
Native context 256K
License Gemma Terms of Use

Tags

Tag GPU Quantization KV cache num_ctx
odytrice/gemma4-31b:5090 RTX 5090 (32 GB Blackwell) Q4_K_M (~19 GB), NVFP4 future q8_0 153600

Why this context size

153600 mirrors the gateway config. 32 GB holds the ~19 GB weights plus q8_0 KV cache for ~150K context with overhead. Well within the model’s native 256K window - no YaRN scaling needed.

If ollama ps shows CPU% on the 4090 tag: drop num_ctx to 32K or switch KV cache to q4_0.

Environment

Always set these before running Ollama:

set OLLAMA_KV_CACHE_TYPE=q4_0    # Windows
set OLLAMA_FLASH_ATTENTION=1

export OLLAMA_KV_CACHE_TYPE=q4_0   # Linux/macOS
export OLLAMA_FLASH_ATTENTION=1

Sampling

temperature   1.0
top_p         0.95
top_k         64

Set via /set parameter or pass from your client.

Strengths

  • Best reasoning in the Gemma 4 family (MMLU Pro, AIME, Codeforces leader)
  • Native vision + native tool calling
  • 140+ languages
  • Gemma Terms permit commercial use

Caveats

  • Dense ~31B is slower per token than the A4B MoE 26B variant
  • NVFP4 weights exist upstream but Ollama does not yet load them

See also