20 Downloads Updated 3 days ago
ollama run odytrice/gemma4-31b:5090
Updated 3 days ago
3 days ago
9a290d5cccce · 20GB ·
Gemma 4 31B dense, vision + native tool calling.
Model card for odytrice/gemma4-31b:5090. The dense 31B at Q4_K_M (~19 GB)
does not leave usable KV cache headroom on a 24 GB 4090, so only a 5090
profile is provided.
| Field | Value |
|---|---|
| Upstream | google/gemma-4-31B-it |
| NVFP4 source | nvidia/Gemma-4-31B-IT-NVFP4 |
| Family | Gemma 4 (Google) |
| Architecture | Dense |
| Params | ~31B (33B on HF card) |
| Modalities | Text + Image (vision) |
| Languages | 140+ |
| Tool calling | Native (structured JSON) |
| Native context | 256K |
| License | Gemma Terms of Use |
| Tag | GPU | Quantization | KV cache | num_ctx |
|---|---|---|---|---|
odytrice/gemma4-31b:5090 |
RTX 5090 (32 GB Blackwell) | Q4_K_M (~19 GB), NVFP4 future | q8_0 | 153600 |
153600 mirrors the gateway config. 32 GB holds the ~19 GB weights plus q8_0 KV cache for ~150K context with overhead. Well within the model’s native 256K window - no YaRN scaling needed.
If ollama ps shows CPU% on the 4090 tag: drop num_ctx to 32K or switch
KV cache to q4_0.
Always set these before running Ollama:
set OLLAMA_KV_CACHE_TYPE=q4_0 # Windows
set OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q4_0 # Linux/macOS
export OLLAMA_FLASH_ATTENTION=1
temperature 1.0
top_p 0.95
top_k 64
Set via /set parameter or pass from your client.