JetBrains/mellum2-thinking-q4_k

JetBrains/ mellum2-thinking-q4_k_m:latest

67 Downloads Updated 1 week ago

Low latency instruct LLM by JetBrains

tools thinking

mellum2-thinking-q4_k_m:latest ... /

license

67130ee26f11 · 2.3kB

Mellum2 Thinking — GGUF (Q4_K_M)

This repository contains a **GGUF Q4_K_M** quantization of

[`JetBrains/Mellum2-12B-A2.5B-Thinking`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking), ready to run with

[`llama.cpp`](https://github.com/ggml-org/llama.cpp), Ollama, LM Studio, and

other GGUF-compatible runtimes.

**This quantization (Q4_K_M):** 4-bit k-quant (medium). Strong quality/size trade-off (KLD ~0.052, 90% top-token agreement) — a good default.

| File | Size |

|---|---|

| `Mellum2-12B-A2.5B-Thinking-Q4_K_M.gguf` | 8.1 GB |

Mellum 2 Thinking is a Mixture-of-Experts reasoning model (64 experts, 8

activated per token, 131,072-token context) that emits its chain of thought

inside `<think>...</think>` blocks before the final answer. For the full model

description, evaluation results, and architecture details, see the original

model card: **[JetBrains/Mellum2-12B-A2.5B-Thinking](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking)**.

## Available quantizations

|---|---|---|---|---|

| [`BF16`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-BF16) | 16-bit, no quantization (reference) | 24.3 GB | — | — |

| [`Q8_0`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q8_0) | 8-bit, effectively lossless | 12.9 GB | 0.004 | 97.4% |

| [`Q6_K`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q6_K) | 6-bit k-quant, very high quality | 10.9 GB | 0.014 | 95.1% |

| **`Q4_K_M` (this repo)** | 4-bit k-quant, balanced (recommended) | 8.1 GB | 0.052 | 89.8% |

| [`MXFP4_MOE`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-MXFP4_MOE) | MXFP4 4-bit on MoE experts, smallest | 7.0 GB | 0.088 | 87.3% |

KL divergence and top-token agreement are measured against the BF16 logits on

Wikitext-2 (`n_ctx=512`); lower KLD / higher agreement means closer to the

unquantized model.

## Run with Ollama

```sh

ollama run hf.co/JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q4_K_M

```

## License

Released under the Apache 2.0 license.

---

*For the full model card, evaluation results, and architecture details, refer to

the original model: [JetBrains/Mellum2-12B-A2.5B-Thinking](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking)*