19 3 days ago

Gemma 4 31B dense, vision + native tool calling.

vision tools thinking
ollama run odytrice/gemma4-31b:5090

Applications

Claude Code
Claude Code ollama launch claude --model odytrice/gemma4-31b:5090
Codex App
Codex App ollama launch codex-app --model odytrice/gemma4-31b:5090
OpenClaw
OpenClaw ollama launch openclaw --model odytrice/gemma4-31b:5090
Hermes Agent
Hermes Agent ollama launch hermes --model odytrice/gemma4-31b:5090
Codex
Codex ollama launch codex --model odytrice/gemma4-31b:5090
OpenCode
OpenCode ollama launch opencode --model odytrice/gemma4-31b:5090

Models

View all →

1 model

gemma4-31b:5090

20GB · 256K context window · Text, Image · 3 days ago

Readme

Gemma 4 31B

Gemma 4 31B dense, vision + native tool calling.

Model card for odytrice/gemma4-31b:5090. The dense 31B at Q4_K_M (~19 GB) does not leave usable KV cache headroom on a 24 GB 4090, so only a 5090 profile is provided.

Upstream

Field Value
Upstream google/gemma-4-31B-it
NVFP4 source nvidia/Gemma-4-31B-IT-NVFP4
Family Gemma 4 (Google)
Architecture Dense
Params ~31B (33B on HF card)
Modalities Text + Image (vision)
Languages 140+
Tool calling Native (structured JSON)
Native context 256K
License Gemma Terms of Use

Tags

Tag GPU Quantization KV cache num_ctx
odytrice/gemma4-31b:5090 RTX 5090 (32 GB Blackwell) Q4_K_M (~19 GB), NVFP4 future q8_0 153600

Why this context size

153600 mirrors the gateway config. 32 GB holds the ~19 GB weights plus q8_0 KV cache for ~150K context with overhead. Well within the model’s native 256K window - no YaRN scaling needed.

If ollama ps shows CPU% on the 4090 tag: drop num_ctx to 32K or switch KV cache to q4_0.

Environment

Always set these before running Ollama:

set OLLAMA_KV_CACHE_TYPE=q4_0    # Windows
set OLLAMA_FLASH_ATTENTION=1

export OLLAMA_KV_CACHE_TYPE=q4_0   # Linux/macOS
export OLLAMA_FLASH_ATTENTION=1

Sampling

temperature   1.0
top_p         0.95
top_k         64

Set via /set parameter or pass from your client.

Strengths

  • Best reasoning in the Gemma 4 family (MMLU Pro, AIME, Codeforces leader)
  • Native vision + native tool calling
  • 140+ languages
  • Gemma Terms permit commercial use

Caveats

  • Dense ~31B is slower per token than the A4B MoE 26B variant
  • NVFP4 weights exist upstream but Ollama does not yet load them

See also