odytrice/ gemma4:4090-12b

393 Downloads Updated 1 month ago

Gemma 4 Ollama profiles for RTX 4090/5090 across 12B, 26B-A4B, and 31B variants, with multimodal support and native tool calling

vision tools thinking

ollama run odytrice/gemma4:4090-12b

curl http://localhost:11434/api/chat \
  -d '{
    "model": "odytrice/gemma4:4090-12b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from ollama import chat

response = chat(
    model='odytrice/gemma4:4090-12b',
    messages=[{'role': 'user', 'content': 'Hello!'}],
)
print(response.message.content)

import ollama from 'ollama'

const response = await ollama.chat({
  model: 'odytrice/gemma4:4090-12b',
  messages: [{role: 'user', content: 'Hello!'}],
})
console.log(response.message.content)

Details

Updated 1 month ago

1 month ago

dff1ee05a898 · 13GB ·

model

archgemma4

·

parameters11.9B

·

quantizationQ8_0

13GB

projector

archclip

·

parameters52.4M

·

quantizationBF16

175MB

license

Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR US

10kB

params

{ "num_ctx": 262144, "num_gpu": 999, "temperature": 1, "top_k": 64, "top_p": 0.9

73B

Readme

Gemma 4

Gemma 4 model profiles for Ollama under the shared odytrice/gemma4 model name. Tags encode target GPU and parameter count as <gpu>-<size>.

Tags

Tag	GPU	Quantization	`num_ctx`
`odytrice/gemma4:4090-12b`	RTX 4090 (24 GB Ada)	Q8_0 (~12 GB)	262144
`odytrice/gemma4:5090-12b`	RTX 5090 (32 GB Blackwell)	BF16 (~24 GB)	262144
`odytrice/gemma4:4090-26b`	RTX 4090 (24 GB Ada)	Q4_K_M (~17 GB)	131072
`odytrice/gemma4:5090-26b`	RTX 5090 (32 GB Blackwell)	Q4_K_M (~17 GB)	262144
`odytrice/gemma4:5090-31b`	RTX 5090 (32 GB Blackwell)	Q4_K_M (~19 GB)	153600

Upstream

Size	Upstream	Architecture	Modalities	Native context
12B	`google/gemma-4-12B` / `google/gemma-4-12B-it`	Dense unified	Text + Image + Audio	256K
26B	`google/gemma-4-26B-A4B-it`	MoE A4B	Text + Image	256K
31B	`google/gemma-4-31B-it`	Dense	Text + Image	256K

Environment

For the 26B and 31B Q4 profiles, set KV cache quantization before running Ollama:

set OLLAMA_KV_CACHE_TYPE=q4_0
set OLLAMA_FLASH_ATTENTION=1

export OLLAMA_KV_CACHE_TYPE=q4_0
export OLLAMA_FLASH_ATTENTION=1

For 12B profiles, flash attention is still recommended:

set OLLAMA_FLASH_ATTENTION=1

export OLLAMA_FLASH_ATTENTION=1

Sampling

Gemma 4 defaults from Ollama:

temperature   1.0
top_p         0.95
top_k         64

Set sampling via /set parameter inside ollama run or pass it as request options from your client. Sampling is not baked into these Modelfiles.

Notes

The 26B 5090 profile uses the known-good Ollama Q4_K_M artifact with a tuned 262144 OpenCode context and q4_0 KV cache. The 31B profile uses 153600 context to fit the dense model on a 32 GB 5090 while staying inside the native 256K window. The direct HF NVFP4/GGUF imports for the larger models have had loader compatibility issues on the remote Ollama 0.23.x server.

# Gemma 4

Gemma 4 model profiles for Ollama under the shared `odytrice/gemma4` model name.
Tags encode target GPU and parameter count as `<gpu>-<size>`.

## Tags

| Tag | GPU | Quantization | `num_ctx` |
|---|---|---|---|
| `odytrice/gemma4:4090-12b` | RTX 4090 (24 GB Ada) | Q8_0 (~12 GB) | 262144 |
| `odytrice/gemma4:5090-12b` | RTX 5090 (32 GB Blackwell) | BF16 (~24 GB) | 262144 |
| `odytrice/gemma4:4090-26b` | RTX 4090 (24 GB Ada) | Q4_K_M (~17 GB) | 131072 |
| `odytrice/gemma4:5090-26b` | RTX 5090 (32 GB Blackwell) | Q4_K_M (~17 GB) | 262144 |
| `odytrice/gemma4:5090-31b` | RTX 5090 (32 GB Blackwell) | Q4_K_M (~19 GB) | 153600 |

## Upstream

| Size | Upstream | Architecture | Modalities | Native context |
|---|---|---|---|---|
| 12B | `google/gemma-4-12B` / `google/gemma-4-12B-it` | Dense unified | Text + Image + Audio | 256K |
| 26B | `google/gemma-4-26B-A4B-it` | MoE A4B | Text + Image | 256K |
| 31B | `google/gemma-4-31B-it` | Dense | Text + Image | 256K |

## Environment

For the 26B and 31B Q4 profiles, set KV cache quantization before running Ollama:

```
set OLLAMA_KV_CACHE_TYPE=q4_0
set OLLAMA_FLASH_ATTENTION=1

export OLLAMA_KV_CACHE_TYPE=q4_0
export OLLAMA_FLASH_ATTENTION=1
```

For 12B profiles, flash attention is still recommended:

```
set OLLAMA_FLASH_ATTENTION=1

export OLLAMA_FLASH_ATTENTION=1
```

## Sampling

Gemma 4 defaults from Ollama:

```
temperature   1.0
top_p         0.95
top_k         64
```

Set sampling via `/set parameter` inside `ollama run` or pass it as request
options from your client. Sampling is not baked into these Modelfiles.

## Notes

The 26B 5090 profile uses the known-good Ollama Q4_K_M artifact with a tuned
262144 OpenCode context and q4_0 KV cache. The 31B profile uses 153600 context
to fit the dense model on a 32 GB 5090 while staying inside the native 256K
window. The direct HF NVFP4/GGUF imports for the larger models have had loader
compatibility issues on the remote Ollama 0.23.x server.

Paste, drop or click to upload images (.png, .jpeg, .jpg, .svg, .gif)