67 1 week ago

Low latency instruct LLM by JetBrains

tools thinking
67130ee26f11 · 2.3kB
Mellum2 Thinking — GGUF (Q4_K_M)
This repository contains a **GGUF Q4_K_M** quantization of
[`JetBrains/Mellum2-12B-A2.5B-Thinking`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking), ready to run with
[`llama.cpp`](https://github.com/ggml-org/llama.cpp), Ollama, LM Studio, and
other GGUF-compatible runtimes.
**This quantization (Q4_K_M):** 4-bit k-quant (medium). Strong quality/size trade-off (KLD ~0.052, 90% top-token agreement) — a good default.
| File | Size |
|---|---|
| `Mellum2-12B-A2.5B-Thinking-Q4_K_M.gguf` | 8.1 GB |
Mellum 2 Thinking is a Mixture-of-Experts reasoning model (64 experts, 8
activated per token, 131,072-token context) that emits its chain of thought
inside `<think>...</think>` blocks before the final answer. For the full model
description, evaluation results, and architecture details, see the original
model card: **[JetBrains/Mellum2-12B-A2.5B-Thinking](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking)**.
## Available quantizations
| Quantization | Description | Size | KLD vs BF16 ↓ | Top-token match ↑ |
|---|---|---|---|---|
| [`BF16`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-BF16) | 16-bit, no quantization (reference) | 24.3 GB | — | — |
| [`Q8_0`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q8_0) | 8-bit, effectively lossless | 12.9 GB | 0.004 | 97.4% |
| [`Q6_K`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q6_K) | 6-bit k-quant, very high quality | 10.9 GB | 0.014 | 95.1% |
| **`Q4_K_M` (this repo)** | 4-bit k-quant, balanced (recommended) | 8.1 GB | 0.052 | 89.8% |
| [`MXFP4_MOE`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-MXFP4_MOE) | MXFP4 4-bit on MoE experts, smallest | 7.0 GB | 0.088 | 87.3% |
KL divergence and top-token agreement are measured against the BF16 logits on
Wikitext-2 (`n_ctx=512`); lower KLD / higher agreement means closer to the
unquantized model.
## Run with Ollama
```sh
ollama run hf.co/JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q4_K_M
```
## License
Released under the Apache 2.0 license.
---
*For the full model card, evaluation results, and architecture details, refer to
the original model: [JetBrains/Mellum2-12B-A2.5B-Thinking](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking)*