JetBrains/ mellum2-instruct-q8_0

16 Downloads Updated 1 week ago

Low latency instruct LLM by JetBrains

tools

ollama run JetBrains/mellum2-instruct-q8_0

curl http://localhost:11434/api/chat \
  -d '{
    "model": "JetBrains/mellum2-instruct-q8_0",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from ollama import chat

response = chat(
    model='JetBrains/mellum2-instruct-q8_0',
    messages=[{'role': 'user', 'content': 'Hello!'}],
)
print(response.message.content)

import ollama from 'ollama'

const response = await ollama.chat({
  model: 'JetBrains/mellum2-instruct-q8_0',
  messages: [{role: 'user', content: 'Hello!'}],
})
console.log(response.message.content)

Applications

Claude Code

Claude Code ollama launch claude --model JetBrains/mellum2-instruct-q8_0

Codex App

Codex App ollama launch codex-app --model JetBrains/mellum2-instruct-q8_0

OpenClaw

OpenClaw ollama launch openclaw --model JetBrains/mellum2-instruct-q8_0

Hermes Agent

Hermes Agent ollama launch hermes --model JetBrains/mellum2-instruct-q8_0

Codex

Codex ollama launch codex --model JetBrains/mellum2-instruct-q8_0

OpenCode

OpenCode ollama launch opencode --model JetBrains/mellum2-instruct-q8_0

Models

Name

1 model

Size / Usage

Context

Input

mellum2-instruct-q8_0:latest

13GB · 128K context window · Text · 1 week ago

mellum2-instruct-q8_0:latest

13GB

128K

Text

Readme

Mellum2 Instruct — Q8_0

This repository contains a GGUF Q8_0 quantization of JetBrains/Mellum2-12B-A2.5B-Instruct, ready to run with llama.cpp, Ollama, LM Studio, and other GGUF-compatible runtimes.

Mellum2 Instruct is a Mixture-of-Experts assistant model (64 experts, 8 activated per token, 131,072-token context) that answers directly, without an externalized chain of thought. For the full model description, evaluation results, and architecture details, see the original model card: JetBrains/Mellum2-12B-A2.5B-Instruct.

Available quantizations

Quantization	Description	Size	KLD vs BF16 ↓	Top-token match ↑
`Q8_0` (this repo)	8-bit, effectively lossless	12.9 GB	0.016	95.2%
`BF16`	16-bit, no quantization (reference)	24.3 GB	—	—
`Q6_K`	6-bit k-quant, very high quality	10.9 GB	0.038	92.9%
`Q4_K_M`	4-bit k-quant, balanced (recommended)	8.1 GB	0.106	87.2%
`MXFP4_MOE`	MXFP4 4-bit on MoE experts, smallest	7.0 GB	0.166	84.2%

KL divergence and top-token agreement are measured against the BF16 logits on Wikitext-2 (n_ctx=512); lower KLD / higher agreement means closer to the unquantized model.

Run with Ollama

ollama create JetBrains/mellum2-instruct-q8_0 -f Modelfile
ollama run JetBrains/mellum2-instruct-q8_0

License

Released under the Apache 2.0 license.

For the full model card, evaluation results, and architecture details, refer to the original model: JetBrains/Mellum2-12B-A2.5B-Instruct.

# Mellum2 Instruct — Q8_0

This repository contains a **GGUF Q8_0** quantization of
[`JetBrains/Mellum2-12B-A2.5B-Instruct`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct), ready to run with
[`llama.cpp`](https://github.com/ggml-org/llama.cpp), Ollama, LM Studio, and
other GGUF-compatible runtimes.

Mellum2 Instruct is a Mixture-of-Experts assistant model (64 experts, 8
activated per token, 131,072-token context) that answers directly, without an
externalized chain of thought. For the full model description, evaluation
results, and architecture details, see the original model card:
**[JetBrains/Mellum2-12B-A2.5B-Instruct](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct)**.

## Available quantizations

| Quantization | Description | Size | KLD vs BF16 ↓ | Top-token match ↑ |
|---|---|---|---|---|
| **`Q8_0` (this repo)** | 8-bit, effectively lossless | 12.9 GB | 0.016 | 95.2% |
| [`BF16`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-BF16) | 16-bit, no quantization (reference) | 24.3 GB | — | — |
| [`Q6_K`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-Q6_K) | 6-bit k-quant, very high quality | 10.9 GB | 0.038 | 92.9% |
| [`Q4_K_M`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-Q4_K_M) | 4-bit k-quant, balanced (recommended) | 8.1 GB | 0.106 | 87.2% |
| [`MXFP4_MOE`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-MXFP4_MOE) | MXFP4 4-bit on MoE experts, smallest | 7.0 GB | 0.166 | 84.2% |

KL divergence and top-token agreement are measured against the BF16 logits on
Wikitext-2 (`n_ctx=512`); lower KLD / higher agreement means closer to the
unquantized model.

## Run with Ollama

```sh
ollama create JetBrains/mellum2-instruct-q8_0 -f Modelfile
ollama run JetBrains/mellum2-instruct-q8_0
```

## License

Released under the Apache 2.0 license.

---

*For the full model card, evaluation results, and architecture details, refer to
the original model: [JetBrains/Mellum2-12B-A2.5B-Instruct](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct).*

Paste, drop or click to upload images (.png, .jpeg, .jpg, .svg, .gif)