16 1 week ago

Low latency instruct LLM by JetBrains

tools
ollama run JetBrains/mellum2-instruct-q8_0

Applications

Claude Code
Claude Code ollama launch claude --model JetBrains/mellum2-instruct-q8_0
Codex App
Codex App ollama launch codex-app --model JetBrains/mellum2-instruct-q8_0
OpenClaw
OpenClaw ollama launch openclaw --model JetBrains/mellum2-instruct-q8_0
Hermes Agent
Hermes Agent ollama launch hermes --model JetBrains/mellum2-instruct-q8_0
Codex
Codex ollama launch codex --model JetBrains/mellum2-instruct-q8_0
OpenCode
OpenCode ollama launch opencode --model JetBrains/mellum2-instruct-q8_0

Models

View all →

Readme

Mellum2 Instruct — Q8_0

This repository contains a GGUF Q8_0 quantization of JetBrains/Mellum2-12B-A2.5B-Instruct, ready to run with llama.cpp, Ollama, LM Studio, and other GGUF-compatible runtimes.

Mellum2 Instruct is a Mixture-of-Experts assistant model (64 experts, 8 activated per token, 131,072-token context) that answers directly, without an externalized chain of thought. For the full model description, evaluation results, and architecture details, see the original model card: JetBrains/Mellum2-12B-A2.5B-Instruct.

Available quantizations

Quantization Description Size KLD vs BF16 ↓ Top-token match ↑
Q8_0 (this repo) 8-bit, effectively lossless 12.9 GB 0.016 95.2%
BF16 16-bit, no quantization (reference) 24.3 GB
Q6_K 6-bit k-quant, very high quality 10.9 GB 0.038 92.9%
Q4_K_M 4-bit k-quant, balanced (recommended) 8.1 GB 0.106 87.2%
MXFP4_MOE MXFP4 4-bit on MoE experts, smallest 7.0 GB 0.166 84.2%

KL divergence and top-token agreement are measured against the BF16 logits on Wikitext-2 (n_ctx=512); lower KLD / higher agreement means closer to the unquantized model.

Run with Ollama

ollama create JetBrains/mellum2-instruct-q8_0 -f Modelfile
ollama run JetBrains/mellum2-instruct-q8_0

License

Released under the Apache 2.0 license.


For the full model card, evaluation results, and architecture details, refer to the original model: JetBrains/Mellum2-12B-A2.5B-Instruct.