26 Downloads Updated 5 days ago
Updated 6 days ago
6 days ago
9470053c61d9 · 29GB ·
A 72B parameter coding model optimized for software engineering tasks, based on the Qwen2.5-72B architecture.
KAT-Dev-72B-Exp is a state-of-the-art coding model created by Kuaishou that achieves 74.6% accuracy on SWE-Bench Verified, making it one of the most capable open-source coding models available. This model excels at:
Four quantized versions are available, offering different trade-offs between quality and resource requirements:
| Variant | Size | Bits per Weight | Best For |
|---|---|---|---|
| iq4_xs | 39 GB | 4.25 bpw | Maximum quality, minimal degradation |
| iq3_m | 35 GB | 3.66 bpw | High quality, good balance |
| iq2_m | 29 GB | 2.7 bpw | Balanced compression |
| iq2_xxs | 25 GB | 2.06 bpw | Maximum compression, minimal memory |
# Choose your preferred quantization
ollama pull richardyoung/kat-dev-72b:iq4_xs # Best quality
ollama pull richardyoung/kat-dev-72b:iq3_m # Recommended
ollama pull richardyoung/kat-dev-72b:iq2_m # Lower memory
ollama pull richardyoung/kat-dev-72b:iq2_xxs # Minimum memory
ollama run richardyoung/kat-dev-72b:iq3_m
Code Generation:
ollama run richardyoung/kat-dev-72b:iq3_m "Write a Python function to implement binary search"
Debugging:
ollama run richardyoung/kat-dev-72b:iq3_m "Debug this code: [paste your code]"
Code Review:
ollama run richardyoung/kat-dev-72b:iq3_m "Review and suggest improvements for: [code]"
All variants use optimized parameters for coding tasks:
- Temperature: 0.6 (balanced creativity and precision)
- Top-p: 0.9 (nucleus sampling)
- Top-k: 40
- Context length: 8192 tokens
- Chat template: Qwen-style (<|im_start|> / <|im_end|>)
Approximate VRAM/RAM needed for inference:
| Variant | Minimum VRAM | Recommended VRAM |
|---|---|---|
| iq4_xs | 40 GB | 48 GB |
| iq3_m | 35 GB | 40 GB |
| iq2_m | 30 GB | 35 GB |
| iq2_xxs | 26 GB | 30 GB |
Quantization: IQ (Importance-Quantized) methods from llama.cpp - Preserves important weights with higher precision - Optimizes less critical weights for size reduction - Maintains model quality while reducing memory footprint
Original Model: Kwaipilot/KAT-Dev-72B-Exp
Quantizations: mradermacher on HuggingFace
iq4_xs - Choose if you: - Have 40GB+ VRAM available - Need maximum quality for production use - Are working on critical or complex projects
iq3_m - Choose if you: - Have 35-40GB VRAM available - Want the best quality-to-size ratio - Need reliable performance for most tasks (Recommended)
iq2_m - Choose if you: - Have 30-35GB VRAM available - Can tolerate slight quality reduction - Need to fit the model in limited memory
iq2_xxs - Choose if you: - Have 26-30GB VRAM available - Prioritize memory efficiency - Need quick prototyping or testing
import requests
import json
def query_kat_dev(prompt, model="richardyoung/kat-dev-72b:iq3_m"):
response = requests.post('http://localhost:11434/api/generate',
json={
"model": model,
"prompt": prompt,
"stream": False
})
return response.json()['response']
# Example
code = query_kat_dev("Write a function to reverse a linked list in Python")
print(code)
This model inherits the license from the original KAT-Dev-72B-Exp model. Please refer to the original model page for licensing details.
If you use this model in your research or applications, please cite:
@misc{kat-dev-72b-2025,
author = {Kuaishou Technology},
title = {KAT-Dev-72B: A High-Performance Coding Model},
year = {2025},
url = {https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp}
}
For issues or questions: - Ollama models: https://ollama.com/richardyoung/kat-dev-72b - Original model: https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp
Note: This is an unofficial distribution. The model is quantized from the original KAT-Dev-72B-Exp for easier deployment via Ollama.