436 Downloads Updated 1 week ago
ollama run scorpion7slayer/gemma-4-12b-it-claude-4.6-4.8-opus
Updated 1 week ago
1 week ago
489d0a174a0f · 7.4GB ·
license: gemma base_model: google/gemma-4-12B-it library_name: gguf pipeline_tag: text-generation
No matter your GPU. No matter your RAM. If you’ve got ~4.5 GB of VRAM or unified memory free, you can run your own private, offline AI right now. 🚀 Tuned on Opus 4.6, 4.7 & 4.8 reasoning data, it delivers a major leap in reasoning power — whether you’re asking questions or writing code. 🧠💻 All local, all yours, no API, no cloud.
As of June 7, 2026, mainline llama.cpp just merged Gemma 4 MTP support — so the MTP draft model is now
live in the MTP/ folder.
Drop it next to any quant and generation gets noticeably faster with identical output (speculative decoding is
lossless) — just add a couple of flags. 👉 See ⚡ Speed it up with MTP below. 💚
| Quant | Size | Vibe |
|---|---|---|
| 🟢 Q2_K | 4.5 GB | tiniest — runs almost anywhere |
| 🔵 Q4_K_M | 6.87 GB | the sweet spot 👌 (recommended) |
| 🟣 Q6_K | 9.11 GB | near-lossless |
| ⚪ Q8_0 | 11.8 GB | basically full quality |
| (f16) | 22.2 GB | full precision (overkill for most) |
Rough estimates 🤓 (assumes q8_0 KV cache + ~1.5 GB overhead; use q4_0 KV cache for ≈2× more context!).
Max context is 131K. “—” = won’t fit, pick a smaller quant. ✂️
| Your VRAM / unified mem | 🟢 Q2_K (4.5G) | 🔵 Q4_K_M (6.87G) | 🟣 Q6_K (9.11G) | ⚪ Q8_0 (11.8G) |
|---|---|---|---|---|
| 8 GB | ~16K ctx | tight (~2–4K) | — | — |
| 12 GB | ~48K | ~30K | ~12K | — |
| 16 GB | ~80K | ~64K | ~44K | ~22K |
| 24 GB | 131K (max) 🎉 | ~128K | ~110K | ~88K |
| 32 GB | 131K | 131K | 131K | 131K |
💡 Apple Silicon / integrated GPUs with unified memory count too — same numbers, just slower than a dGPU. 💡 Low on room? Drop a quant or switch KV cache to
q4_0and your context roughly doubles.
New as of June 7, 2026! Gemma 4’s Multi-Token Prediction drafter lets the model guess a few tokens ahead and verify them in one shot — so you get more tokens/sec with byte-for-byte identical output. Pure speed, zero quality cost. 🪄
1. Grab the tiny draft from the MTP/ folder:
| Draft file | Size | Use it for |
|---|---|---|
⚪ gemma-4-12B-it-MTP-Q8_0.gguf |
0.44 GB | recommended — tiny + full speed |
…-F16.gguf / …-BF16.gguf |
0.82 GB | full-precision draft (overkill) |
💡 The draft is tiny — keep it Q8 or higher (over-quantizing a draft just lowers its hit rate). It pairs with any quant of the main model.
2. You need a fresh llama.cpp build — June 7 2026 (b9553) or newer. MTP was just merged, so older builds
can’t load the draft (unknown architecture: 'gemma4-assistant').
3. Run it exactly like below, just +3 flags (--model-draft, --spec-type, --n-gpu-layers-draft):
@echo off
cd /d C:\llama.cpp
llama-server.exe ^
-m C:\models\gemma4-opus48-Q4_K_M.gguf ^
--model-draft C:\models\MTP\gemma-4-12B-it-MTP-Q8_0.gguf ^
--spec-type draft-mtp --spec-draft-n-max 4 ^
--ctx-size 16384 --n-gpu-layers 99 --n-gpu-layers-draft 99 ^
--no-mmap -fa on ^
--temp 1.0 --top-p 0.95 --top-k 64 ^
--host 0.0.0.0 --port 18080
pause
Measured on a single RTX 5090 (Q4_K_M main + Q8 draft): ~1.3× faster at greedy and ~1.2× at the default thinking sampling — free, with no change to output. 🎈
🔧 Heads-up: this is the stock Gemma drafter (trained on base Gemma 4), so on this fine-tune the hit rate — and thus the speedup — is a little lower than on vanilla Gemma 4. A re-aligned draft could push it higher (maybe a future update). Either way: free speed, no downside. 💚
…-Q4_K_M.gguf) and llama-server from llama.cpp.
> ⚠️ Needs a recent llama.cpp (this is the gemma4_unified architecture — older builds won’t load it)..bat shown — tweak --port, --ctx-size to taste):@echo off
cd /d C:\llama.cpp
llama-server.exe ^
-m C:\models\gemma4-opus48-Q4_K_M.gguf ^
--ctx-size 16384 ^
--n-gpu-layers 99 ^
--no-mmap ^
-fa on ^
--cache-type-k q8_0 --cache-type-v q8_0 ^
--temp 1.0 --top-p 0.95 --top-k 64 ^
--host 0.0.0.0 --port 18080
pause
http://localhost:18080 and chat. 🎉 (Tip: bump --ctx-size per the table; use q4_0 KV for more.)Works in LM Studio, Jan, Ollama, etc. — just import the GGUF, pick your quant, go. 🐾
This model thinks in Gemma’s native thought channel. Keep enable_thinking=true (the default chat template
handles it). Recommended sampling: temp 1.0, top_p 0.95, top_k 64.
google/gemma-4-12B-it. Subject to the
Gemma Terms of Use (derivatives must comply).angrygiraffe/claude-opus-4.6-4.7-reasoning-8.7k,
augmented with additional Opus 4.8-generated reasoning samples I curated and mixed in.