JetBrains/ mellum2-instruct-q4_k_m:latest

55 Downloads Updated 1 week ago

Low latency instruct LLM by JetBrains

tools

ollama run JetBrains/mellum2-instruct-q4_k_m

curl http://localhost:11434/api/chat \
  -d '{
    "model": "JetBrains/mellum2-instruct-q4_k_m",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from ollama import chat

response = chat(
    model='JetBrains/mellum2-instruct-q4_k_m',
    messages=[{'role': 'user', 'content': 'Hello!'}],
)
print(response.message.content)

import ollama from 'ollama'

const response = await ollama.chat({
  model: 'JetBrains/mellum2-instruct-q4_k_m',
  messages: [{role: 'user', content: 'Hello!'}],
})
console.log(response.message.content)

Details

Updated 1 week ago

1 week ago

dc1c2dd93ce5 · 8.1GB ·

model

archmellum

·

parameters12.1B

·

quantizationQ4_K_M

8.1GB

template

{{- if .Messages }} {{- $lastUser := -1 }} {{- range $i, $m := .Messages }} {{- if eq $m.Role "user"

1.6kB

license

Released under the Apache 2.0 license.

38B

params

{ "num_ctx": 131072, "stop": [ "<|im_start|>", "<|im_end|>" ], "temp

118B

Readme

Mellum2 Instruct — Q4_K_M

This repository contains a GGUF Q4_K_M quantization of JetBrains/Mellum2-12B-A2.5B-Instruct, ready to run with llama.cpp, Ollama, LM Studio, and other GGUF-compatible runtimes.

Mellum2 Instruct is a Mixture-of-Experts assistant model (64 experts, 8 activated per token, 131,072-token context) that answers directly, without an externalized chain of thought. For the full model description, evaluation results, and architecture details, see the original model card: JetBrains/Mellum2-12B-A2.5B-Instruct.

Available quantizations

Quantization	Description	Size	KLD vs BF16 ↓	Top-token match ↑
`Q4_K_M` (this repo)	4-bit k-quant, balanced (recommended)	8.1 GB	0.106	87.2%
`BF16`	16-bit, no quantization (reference)	24.3 GB	—	—
`Q8_0`	8-bit, effectively lossless	12.9 GB	0.016	95.2%
`Q6_K`	6-bit k-quant, very high quality	10.9 GB	0.038	92.9%
`MXFP4_MOE`	MXFP4 4-bit on MoE experts, smallest	7.0 GB	0.166	84.2%

KL divergence and top-token agreement are measured against the BF16 logits on Wikitext-2 (n_ctx=512); lower KLD / higher agreement means closer to the unquantized model.

Run with Ollama

ollama create JetBrains/mellum2-instruct-q4_k_m -f Modelfile
ollama run JetBrains/mellum2-instruct-q4_k_m

License

Released under the Apache 2.0 license.

For the full model card, evaluation results, and architecture details, refer to the original model: JetBrains/Mellum2-12B-A2.5B-Instruct.

# Mellum2 Instruct — Q4_K_M

This repository contains a **GGUF Q4_K_M** quantization of
[`JetBrains/Mellum2-12B-A2.5B-Instruct`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct), ready to run with
[`llama.cpp`](https://github.com/ggml-org/llama.cpp), Ollama, LM Studio, and
other GGUF-compatible runtimes.

Mellum2 Instruct is a Mixture-of-Experts assistant model (64 experts, 8
activated per token, 131,072-token context) that answers directly, without an
externalized chain of thought. For the full model description, evaluation
results, and architecture details, see the original model card:
**[JetBrains/Mellum2-12B-A2.5B-Instruct](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct)**.

## Available quantizations

| Quantization | Description | Size | KLD vs BF16 ↓ | Top-token match ↑ |
|---|---|---|---|---|
| **`Q4_K_M` (this repo)** | 4-bit k-quant, balanced (recommended) | 8.1 GB | 0.106 | 87.2% |
| [`BF16`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-BF16) | 16-bit, no quantization (reference) | 24.3 GB | — | — |
| [`Q8_0`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-Q8_0) | 8-bit, effectively lossless | 12.9 GB | 0.016 | 95.2% |
| [`Q6_K`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-Q6_K) | 6-bit k-quant, very high quality | 10.9 GB | 0.038 | 92.9% |
| [`MXFP4_MOE`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct-GGUF-MXFP4_MOE) | MXFP4 4-bit on MoE experts, smallest | 7.0 GB | 0.166 | 84.2% |

KL divergence and top-token agreement are measured against the BF16 logits on
Wikitext-2 (`n_ctx=512`); lower KLD / higher agreement means closer to the
unquantized model.

## Run with Ollama

```sh
ollama create JetBrains/mellum2-instruct-q4_k_m -f Modelfile
ollama run JetBrains/mellum2-instruct-q4_k_m
```

## License

Released under the Apache 2.0 license.

---

*For the full model card, evaluation results, and architecture details, refer to
the original model: [JetBrains/Mellum2-12B-A2.5B-Instruct](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Instruct).*

Paste, drop or click to upload images (.png, .jpeg, .jpg, .svg, .gif)