755 Downloads Updated 5 days ago
ollama run xentriom/gemma-4-12B-coder-fable5-composer2.5-v1
All credit goes to yuxinlu1, this is ported directly from:
https://huggingface.co/yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF
No matter your GPU. No matter your RAM. If you’ve got ~4.5 GB of VRAM or unified memory free, you can run your own private, offline coding assistant right now. 🚀 This is the v1 / code edition — distilled from real chain-of-thought so it thinks through a problem before writing the solution. 🧠💻 All local, all yours, no API, no cloud.
A focused fine-tune of Gemma 4 12B on verifiable Python coding data — every training example’s reasoning leads to code that actually passed its tests. The result reasons in the open (edge cases, complexity, approach) and then emits a clean, runnable solution. 💚
🚀🔥 IT’S HERE — v2 is OUT NOW! v2 has shipped — the GGUF quants are live and ready to run →
grab v2 here. 🎉
The full safetensors master (build / fine-tune on top) goes up tomorrow. v2 is agentic + coding focused —
the piece v1 was missing.
Here’s the result that got me most excited. When I saw v2’s tau2-bench telecom result — an agentic tool-use
benchmark where the model has to diagnose → fix → verify, exactly like real terminal/debugging work — I literally got
launched out of my chair (…okay, kidding 😄). The jump in actually solving the problem is wild:
| tau2-bench telecom · local, same harness, Q8_0 | score |
|---|---|
official gemma-4-12B-it (base) |
~15% |
| 🟢 v2 (this release) | ~55% |
The base model tends to give up early (hands the problem off to a human); v2 keeps going and works it the way a much bigger model would. Full benchmark details are in the v2 card now. 🔧
✅ safetensors master (this v1 model) is UP. Full-precision weights are live → yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1 — roll your own GGUF / MLX / AWQ quants or fine-tune straight from the master. 🎉
A community member spotted that this model was reporting only a 131K context window. That turned out to be
the well-known upstream Gemma 4 metadata bug — Google’s initial config.json shipped with
max_position_embeddings: 131072 instead of the real 262144 (256K), and that value got baked into a lot of
downstream finetunes and quants (including this one) before it was fixed upstream.
The weights were always fine — it was purely a metadata field. All GGUF quants have been re-patched to the
full 256K context (gemma4.context_length = 262144). Just re-download if you grabbed an earlier copy. 🙏
This is a distillation of two complementary chain-of-thought sources, both over verifiable Python coding tasks (algorithmic / function-level problems that come with deterministic tests):
The recipe: real CoT for the bulk of solid coverage, plus synthetic “second-attempt” CoT to patch the failures — both verified by execution before anything entered training. ✅
| Quant | Size | Vibe |
|---|---|---|
| 🟢 Q2_K | 4.5 GB | tiniest — runs almost anywhere |
| 🟡 Q3_K_M | 5.7 GB | great for 8 GB VRAM — much better than Q2 |
| 🔵 Q4_K_M | 6.87 GB | the sweet spot 👌 (recommended) |
| 🟣 Q6_K | 9.11 GB | near-lossless |
| ⚪ Q8_0 | 11.8 GB | basically full quality |
Rough estimates 🤓 (assumes q8_0 KV cache + ~1.5 GB overhead; use q4_0 KV cache for ≈2× more context!).
Max context is 256K. “—” = won’t fit, pick a smaller quant. ✂️
| Your VRAM / unified mem | 🟢 Q2_K (4.5G) | 🟡 Q3_K_M (5.7G) | 🔵 Q4_K_M (6.87G) | 🟣 Q6_K (9.11G) | ⚪ Q8_0 (11.8G) |
|---|---|---|---|---|---|
| 8 GB | ~16K ctx | ~10K | tight (~2–4K) | — | — |
| 12 GB | ~48K | ~38K | ~30K | ~12K | — |
| 16 GB | ~80K | ~72K | ~64K | ~44K | ~22K |
| 24 GB | ~200K | ~160K | ~128K | ~110K | ~88K |
| 32 GB | 256K (max) 🎉 | 256K | 256K | ~230K | ~190K |
💡 Apple Silicon / integrated GPUs with unified memory count too — same numbers, just slower than a dGPU. 💡 Low on room? Drop a quant or switch KV cache to
q4_0and your context roughly doubles.
…-Q4_K_M.gguf) and llama-server from llama.cpp.
> ⚠️ Needs a recent llama.cpp (this is the gemma4_unified architecture — older builds won’t load it)..bat shown — tweak --port, --ctx-size to taste):@echo off
cd /d C:\llama.cpp
llama-server.exe ^
-m C:\models\gemma4-coding-Q4_K_M.gguf ^
--ctx-size 16384 ^
--n-gpu-layers 99 ^
--no-mmap ^
-fa on ^
--cache-type-k q8_0 --cache-type-v q8_0 ^
--temp 1.0 --top-p 0.95 --top-k 64 ^
--host 0.0.0.0 --port 18080
pause
http://localhost:18080 and chat. 🎉 (Tip: bump --ctx-size per the table; use q4_0 KV for more.)Works in LM Studio, Jan, Ollama, etc. — just import the GGUF, pick your quant, go. 🐾
This model thinks in Gemma’s native thought channel before answering — exactly how it was trained. Keep
enable_thinking=true (the default chat template handles it). Recommended sampling: temp 1.0, top_p 0.95, top_k 64.
For coding you can also go greedy (temp 0) for more deterministic solutions.
google/gemma-4-12B-it.