61 Downloads Updated 3 days ago
ollama run odytrice/gemma4-26b:5090
Updated 3 days ago
3 days ago
15ebceb916f3 · 18GB ·
Gemma 4 26B-A4B (MoE, ~26B total / 4B active), vision + native tool calling.
Shared model card for odytrice/gemma4-26b:4090 and odytrice/gemma4-26b:5090.
Ollama’s registry shares the description across tags of the same model name,
so both GPU profiles live under this one card.
| Field | Value |
|---|---|
| Upstream | google/gemma-4-26B-A4B-it |
| NVFP4 source | nvidia/Gemma-4-26B-A4B-NVFP4 |
| Family | Gemma 4 (Google) |
| Architecture | Mixture-of-Experts (A4B) |
| Total / Active params | ~26B / 4B |
| Modalities | Text + Image (vision) |
| Languages | 140+ |
| Tool calling | Native (structured JSON) |
| Native context | 256K |
| License | Gemma Terms of Use |
| Tag | GPU | Quantization | KV cache | num_ctx |
|---|---|---|---|---|
odytrice/gemma4-26b:4090 |
RTX 4090 (24 GB Ada) | Q4_K_M (~17 GB) | q4_0 | 262144 |
odytrice/gemma4-26b:5090 |
RTX 5090 (32 GB Blackwell) | Q4_K_M (~17 GB), NVFP4 future | q8_0 | 262144 |
262144 (256K) is the model’s native window. The MoE architecture with only ~4B active params leaves enough KV cache headroom for full native context on both tiers: the 4090 at q4_0 KV cache and the 5090 at q8_0.
Always set these before running Ollama:
set OLLAMA_KV_CACHE_TYPE=q4_0 # Windows
set OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q4_0 # Linux/macOS
export OLLAMA_FLASH_ATTENTION=1
Gemma 4 sampling differs from the Qwen-style defaults used elsewhere in this repo:
temperature 1.0
top_p 0.95
top_k 64
Set via /set parameter inside ollama run or pass as request options
from your client (OpenCode, Aider, etc.). Not baked into the Modelfiles.
ollama ps;
no FP4 tensor-core acceleration on Ada