odytrice/ gemma4-31b:5090

20 Downloads Updated 3 days ago

Gemma 4 31B dense, vision + native tool calling.

vision tools thinking

ollama run odytrice/gemma4-31b:5090

curl http://localhost:11434/api/chat \
  -d '{
    "model": "odytrice/gemma4-31b:5090",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from ollama import chat

response = chat(
    model='odytrice/gemma4-31b:5090',
    messages=[{'role': 'user', 'content': 'Hello!'}],
)
print(response.message.content)

import ollama from 'ollama'

const response = await ollama.chat({
  model: 'odytrice/gemma4-31b:5090',
  messages: [{role: 'user', content: 'Hello!'}],
})
console.log(response.message.content)

Details

Updated 3 days ago

3 days ago

9a290d5cccce · 20GB ·

model

archgemma4

·

parameters31.3B

·

quantizationQ4_K_M

20GB

license

Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR US

11kB

params

{ "num_ctx": 153600, "num_gpu": 999, "temperature": 1, "top_k": 64, "top_p": 0.9

73B

Readme

Gemma 4 31B

Gemma 4 31B dense, vision + native tool calling.

Model card for odytrice/gemma4-31b:5090. The dense 31B at Q4_K_M (~19 GB) does not leave usable KV cache headroom on a 24 GB 4090, so only a 5090 profile is provided.

Upstream

Field	Value
Upstream	`google/gemma-4-31B-it`
NVFP4 source	`nvidia/Gemma-4-31B-IT-NVFP4`
Family	Gemma 4 (Google)
Architecture	Dense
Params	~31B (33B on HF card)
Modalities	Text + Image (vision)
Languages	140+
Tool calling	Native (structured JSON)
Native context	256K
License	Gemma Terms of Use

Tags

Tag	GPU	Quantization	KV cache	`num_ctx`
`odytrice/gemma4-31b:5090`	RTX 5090 (32 GB Blackwell)	Q4_K_M (~19 GB), NVFP4 future	q8_0	153600

Why this context size

153600 mirrors the gateway config. 32 GB holds the ~19 GB weights plus q8_0 KV cache for ~150K context with overhead. Well within the model’s native 256K window - no YaRN scaling needed.

If ollama ps shows CPU% on the 4090 tag: drop num_ctx to 32K or switch KV cache to q4_0.

Environment

Always set these before running Ollama:

set OLLAMA_KV_CACHE_TYPE=q4_0    # Windows
set OLLAMA_FLASH_ATTENTION=1

export OLLAMA_KV_CACHE_TYPE=q4_0   # Linux/macOS
export OLLAMA_FLASH_ATTENTION=1

Sampling

temperature   1.0
top_p         0.95
top_k         64

Set via /set parameter or pass from your client.

Strengths

Best reasoning in the Gemma 4 family (MMLU Pro, AIME, Codeforces leader)
Native vision + native tool calling
140+ languages
Gemma Terms permit commercial use

Caveats

Dense ~31B is slower per token than the A4B MoE 26B variant
NVFP4 weights exist upstream but Ollama does not yet load them

See also

Gemma 4 26B A4B MoE card - faster A4B MoE sibling
Hugging Face: https://huggingface.co/google/gemma-4-31B-it
Hugging Face NVFP4: https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4
32 GB tier guide at the repo root

# Gemma 4 31B

> Gemma 4 31B dense, vision + native tool calling.

Model card for `odytrice/gemma4-31b:5090`. The dense 31B at Q4_K_M (~19 GB)
does not leave usable KV cache headroom on a 24 GB 4090, so only a 5090
profile is provided.

## Upstream

| Field | Value |
|---|---|
| Upstream | `google/gemma-4-31B-it` |
| NVFP4 source | `nvidia/Gemma-4-31B-IT-NVFP4` |
| Family | Gemma 4 (Google) |
| Architecture | Dense |
| Params | ~31B (33B on HF card) |
| Modalities | Text + Image (vision) |
| Languages | 140+ |
| Tool calling | Native (structured JSON) |
| Native context | 256K |
| License | Gemma Terms of Use |

## Tags

| Tag | GPU | Quantization | KV cache | `num_ctx` |
|---|---|---|---|---|
| `odytrice/gemma4-31b:5090` | RTX 5090 (32 GB Blackwell) | Q4_K_M (~19 GB), NVFP4 future | q8_0 | 153600 |

### Why this context size

153600 mirrors the gateway config. 32 GB holds the ~19 GB weights plus
q8_0 KV cache for ~150K context with overhead. Well within the model's
native 256K window - no YaRN scaling needed.

If `ollama ps` shows CPU% on the 4090 tag: drop `num_ctx` to 32K or switch
KV cache to `q4_0`.

## Environment

Always set these before running Ollama:

```
set OLLAMA_KV_CACHE_TYPE=q4_0    # Windows
set OLLAMA_FLASH_ATTENTION=1

export OLLAMA_KV_CACHE_TYPE=q4_0   # Linux/macOS
export OLLAMA_FLASH_ATTENTION=1
```

## Sampling

```
temperature   1.0
top_p         0.95
top_k         64
```

Set via `/set parameter` or pass from your client.

## Strengths

- Best reasoning in the Gemma 4 family (MMLU Pro, AIME, Codeforces leader)
- Native vision + native tool calling
- 140+ languages
- Gemma Terms permit commercial use

## Caveats

- Dense ~31B is slower per token than the A4B MoE 26B variant
- NVFP4 weights exist upstream but Ollama does not yet load them

## See also

- Gemma 4 26B A4B MoE card - faster A4B MoE sibling
- Hugging Face: https://huggingface.co/google/gemma-4-31B-it
- Hugging Face NVFP4: https://huggingface.co/nvidia/Gemma-4-31B-IT-NVFP4
- 32 GB tier guide at the repo root

Paste, drop or click to upload images (.png, .jpeg, .jpg, .svg, .gif)