26 5 days ago

A 72B parameter coding model optimized for software engineering tasks, based on the Qwen2.5-72B architecture.

5 days ago

225eeb2510cc · 64GB ·

qwen2
·
72.7B
·
Q6_K
<|im_start|>system {{ .System }}<|im_end|> {{ if .Messages }}{{ range .Messages }}{{ if eq .Role "us
You are KAT-Dev, a highly capable AI coding assistant created by Kuaishou. You excel at software eng
{ "num_ctx": 8192, "stop": [ "<|im_start|>", "<|im_end|>", "<|endoft

Readme

KAT-Dev-72B - Advanced Coding Assistant

A 72B parameter coding model optimized for software engineering tasks, based on the Qwen2.5-72B architecture.

Overview

KAT-Dev-72B-Exp is a state-of-the-art coding model created by Kuaishou that achieves 74.6% accuracy on SWE-Bench Verified, making it one of the most capable open-source coding models available. This model excels at:

  • Code generation and completion
  • Debugging and error analysis
  • Code refactoring and optimization
  • Multi-language programming support
  • Software engineering problem-solving

Model Variants

Four quantized versions are available, offering different trade-offs between quality and resource requirements:

Variant Size Bits per Weight Best For
iq4_xs 39 GB 4.25 bpw Maximum quality, minimal degradation
iq3_m 35 GB 3.66 bpw High quality, good balance
iq2_m 29 GB 2.7 bpw Balanced compression
iq2_xxs 25 GB 2.06 bpw Maximum compression, minimal memory

Quick Start

Pull the model

# Choose your preferred quantization
ollama pull richardyoung/kat-dev-72b:iq4_xs   # Best quality
ollama pull richardyoung/kat-dev-72b:iq3_m    # Recommended
ollama pull richardyoung/kat-dev-72b:iq2_m    # Lower memory
ollama pull richardyoung/kat-dev-72b:iq2_xxs  # Minimum memory

Run the model

ollama run richardyoung/kat-dev-72b:iq3_m

Example Usage

Code Generation:

ollama run richardyoung/kat-dev-72b:iq3_m "Write a Python function to implement binary search"

Debugging:

ollama run richardyoung/kat-dev-72b:iq3_m "Debug this code: [paste your code]"

Code Review:

ollama run richardyoung/kat-dev-72b:iq3_m "Review and suggest improvements for: [code]"

Model Configuration

All variants use optimized parameters for coding tasks: - Temperature: 0.6 (balanced creativity and precision) - Top-p: 0.9 (nucleus sampling) - Top-k: 40 - Context length: 8192 tokens - Chat template: Qwen-style (<|im_start|> / <|im_end|>)

Performance

  • SWE-Bench Verified: 74.6% accuracy
  • Architecture: Qwen2.5-72B base
  • Training: Optimized for software engineering tasks
  • Languages: Strong support for Python, JavaScript, TypeScript, Java, C++, Go, Rust, and more

Memory Requirements

Approximate VRAM/RAM needed for inference:

Variant Minimum VRAM Recommended VRAM
iq4_xs 40 GB 48 GB
iq3_m 35 GB 40 GB
iq2_m 30 GB 35 GB
iq2_xxs 26 GB 30 GB

Use Cases

Software Development

  • Generate boilerplate code and templates
  • Implement algorithms and data structures
  • Create unit tests and test cases
  • Write documentation and comments

Code Analysis

  • Debug and fix errors
  • Optimize performance bottlenecks
  • Refactor legacy code
  • Explain complex code sections

Learning & Education

  • Understand programming concepts
  • Learn new languages and frameworks
  • Practice coding problems
  • Get detailed explanations of code

Technical Details

Quantization: IQ (Importance-Quantized) methods from llama.cpp - Preserves important weights with higher precision - Optimizes less critical weights for size reduction - Maintains model quality while reducing memory footprint

Original Model: Kwaipilot/KAT-Dev-72B-Exp

Quantizations: mradermacher on HuggingFace

Choosing the Right Variant

iq4_xs - Choose if you: - Have 40GB+ VRAM available - Need maximum quality for production use - Are working on critical or complex projects

iq3_m - Choose if you: - Have 35-40GB VRAM available - Want the best quality-to-size ratio - Need reliable performance for most tasks (Recommended)

iq2_m - Choose if you: - Have 30-35GB VRAM available - Can tolerate slight quality reduction - Need to fit the model in limited memory

iq2_xxs - Choose if you: - Have 26-30GB VRAM available - Prioritize memory efficiency - Need quick prototyping or testing

API Usage

import requests
import json

def query_kat_dev(prompt, model="richardyoung/kat-dev-72b:iq3_m"):
    response = requests.post('http://localhost:11434/api/generate',
                           json={
                               "model": model,
                               "prompt": prompt,
                               "stream": False
                           })
    return response.json()['response']

# Example
code = query_kat_dev("Write a function to reverse a linked list in Python")
print(code)

License

This model inherits the license from the original KAT-Dev-72B-Exp model. Please refer to the original model page for licensing details.

Citation

If you use this model in your research or applications, please cite:

@misc{kat-dev-72b-2025,
  author = {Kuaishou Technology},
  title = {KAT-Dev-72B: A High-Performance Coding Model},
  year = {2025},
  url = {https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp}
}

Acknowledgments

  • Original Model: Kuaishou Technology (Kwaipilot)
  • Quantizations: mradermacher
  • Framework: Ollama, llama.cpp
  • Base Architecture: Qwen2.5-72B by Alibaba Cloud

Support & Issues

For issues or questions: - Ollama models: https://ollama.com/richardyoung/kat-dev-72b - Original model: https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp


Note: This is an unofficial distribution. The model is quantized from the original KAT-Dev-72B-Exp for easier deployment via Ollama.