70 1 week ago

Low latency instruct LLM by JetBrains

tools thinking
ollama run JetBrains/mellum2-thinking-q4_k_m

Details

1 week ago

36a1b2f1712f · 8.1GB ·

mellum
·
12.1B
·
Q4_K_M
{{- if .Messages }} {{- $lastUser := -1 }} {{- range $i, $m := .Messages }} {{- if eq $m.Role "user"
Mellum2 Thinking — GGUF (Q4_K_M) This repository contains a **GGUF Q4_K_M** quantization of [`JetB
{ "num_ctx": 131072, "stop": [ "<|im_start|>", "<|im_end|>" ], "temp

Readme

Mellum2 Thinking — Q4_K_M

This repository contains a GGUF Q4_K_M quantization of JetBrains/Mellum2-12B-A2.5B-Thinking, ready to run with llama.cpp, Ollama, LM Studio, and other GGUF-compatible runtimes.

Mellum2 Thinking is a Mixture-of-Experts reasoning model (64 experts, 8 activated per token, 131,072-token context) that emits its chain of thought inside <think>...</think> blocks before the final answer. For the full model description, evaluation results, and architecture details, see the original model card: JetBrains/Mellum2-12B-A2.5B-Thinking.

Available quantizations

Quantization Description Size KLD vs BF16 ↓ Top-token match ↑
Q4_K_M (this repo) 4-bit k-quant, balanced (recommended) 8.1 GB 0.052 89.8%
BF16 16-bit, no quantization (reference) 24.3 GB
Q8_0 8-bit, effectively lossless 12.9 GB 0.004 97.4%
Q6_K 6-bit k-quant, very high quality 10.9 GB 0.014 95.1%
MXFP4_MOE MXFP4 4-bit on MoE experts, smallest 7.0 GB 0.088 87.3%

KL divergence and top-token agreement are measured against the BF16 logits on Wikitext-2 (n_ctx=512); lower KLD / higher agreement means closer to the unquantized model.

Run with Ollama

ollama run hf.co/JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q4_K_M

License

Released under the Apache 2.0 license.


For the full model card, evaluation results, and architecture details, refer to the original model: JetBrains/Mellum2-12B-A2.5B-Thinking.