JetBrains/ mellum2-thinking-q4_k_m:latest

70 Downloads Updated 1 week ago

Low latency instruct LLM by JetBrains

tools thinking

ollama run JetBrains/mellum2-thinking-q4_k_m

curl http://localhost:11434/api/chat \
  -d '{
    "model": "JetBrains/mellum2-thinking-q4_k_m",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from ollama import chat

response = chat(
    model='JetBrains/mellum2-thinking-q4_k_m',
    messages=[{'role': 'user', 'content': 'Hello!'}],
)
print(response.message.content)

import ollama from 'ollama'

const response = await ollama.chat({
  model: 'JetBrains/mellum2-thinking-q4_k_m',
  messages: [{role: 'user', content: 'Hello!'}],
})
console.log(response.message.content)

Details

Updated 1 week ago

1 week ago

36a1b2f1712f · 8.1GB ·

model

archmellum

·

parameters12.1B

·

quantizationQ4_K_M

8.1GB

template

{{- if .Messages }} {{- $lastUser := -1 }} {{- range $i, $m := .Messages }} {{- if eq $m.Role "user"

1.7kB

license

Mellum2 Thinking — GGUF (Q4_K_M) This repository contains a **GGUF Q4_K_M** quantization of [`JetB

2.3kB

params

{ "num_ctx": 131072, "stop": [ "<|im_start|>", "<|im_end|>" ], "temp

118B

Readme

Mellum2 Thinking — Q4_K_M

This repository contains a GGUF Q4_K_M quantization of JetBrains/Mellum2-12B-A2.5B-Thinking, ready to run with llama.cpp, Ollama, LM Studio, and other GGUF-compatible runtimes.

Mellum2 Thinking is a Mixture-of-Experts reasoning model (64 experts, 8 activated per token, 131,072-token context) that emits its chain of thought inside <think>...</think> blocks before the final answer. For the full model description, evaluation results, and architecture details, see the original model card: JetBrains/Mellum2-12B-A2.5B-Thinking.

Available quantizations

Quantization	Description	Size	KLD vs BF16 ↓	Top-token match ↑
`Q4_K_M` (this repo)	4-bit k-quant, balanced (recommended)	8.1 GB	0.052	89.8%
`BF16`	16-bit, no quantization (reference)	24.3 GB	—	—
`Q8_0`	8-bit, effectively lossless	12.9 GB	0.004	97.4%
`Q6_K`	6-bit k-quant, very high quality	10.9 GB	0.014	95.1%
`MXFP4_MOE`	MXFP4 4-bit on MoE experts, smallest	7.0 GB	0.088	87.3%

KL divergence and top-token agreement are measured against the BF16 logits on Wikitext-2 (n_ctx=512); lower KLD / higher agreement means closer to the unquantized model.

Run with Ollama

ollama run hf.co/JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q4_K_M

License

Released under the Apache 2.0 license.

For the full model card, evaluation results, and architecture details, refer to the original model: JetBrains/Mellum2-12B-A2.5B-Thinking.

# Mellum2 Thinking — Q4_K_M

This repository contains a **GGUF Q4_K_M** quantization of
[`JetBrains/Mellum2-12B-A2.5B-Thinking`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking), ready to run with
[`llama.cpp`](https://github.com/ggml-org/llama.cpp), Ollama, LM Studio, and
other GGUF-compatible runtimes.

Mellum2 Thinking is a Mixture-of-Experts reasoning model (64 experts, 8
activated per token, 131,072-token context) that emits its chain of thought
inside `<think>...</think>` blocks before the final answer. For the full model
description, evaluation results, and architecture details, see the original
model card: **[JetBrains/Mellum2-12B-A2.5B-Thinking](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking)**.

## Available quantizations

| Quantization | Description | Size | KLD vs BF16 ↓ | Top-token match ↑ |
|---|---|---|---|---|
| **`Q4_K_M` (this repo)** | 4-bit k-quant, balanced (recommended) | 8.1 GB | 0.052 | 89.8% |
| [`BF16`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-BF16) | 16-bit, no quantization (reference) | 24.3 GB | — | — |
| [`Q8_0`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q8_0) | 8-bit, effectively lossless | 12.9 GB | 0.004 | 97.4% |
| [`Q6_K`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q6_K) | 6-bit k-quant, very high quality | 10.9 GB | 0.014 | 95.1% |
| [`MXFP4_MOE`](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-MXFP4_MOE) | MXFP4 4-bit on MoE experts, smallest | 7.0 GB | 0.088 | 87.3% |

KL divergence and top-token agreement are measured against the BF16 logits on
Wikitext-2 (`n_ctx=512`); lower KLD / higher agreement means closer to the
unquantized model.

## Run with Ollama

```sh
ollama run hf.co/JetBrains/Mellum2-12B-A2.5B-Thinking-GGUF-Q4_K_M
```

## License

Released under the Apache 2.0 license.

---

*For the full model card, evaluation results, and architecture details, refer to
the original model: [JetBrains/Mellum2-12B-A2.5B-Thinking](https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking).*

Paste, drop or click to upload images (.png, .jpeg, .jpg, .svg, .gif)