Qwen3-235B-A22B-Instruct-2507-UD-Q2_K_XL

Ollama model based on Unsloth’s UD-Q2_K_XL quantization of Qwen3-235B-A22B-Instruct-2507.

Model Details

Base Model: Qwen3-235B-A22B-Instruct-2507
Quantization: Unsloth Dynamic 2.0 (UD-Q2_K_XL) - 2-bit with superior accuracy
Context Length: 262,144 tokens
Model Size: ~88 GB
Type: Instruct model (non-thinking mode)

Usage

Run the model locally:

ollama run erbano/Qwen3-235B-A22B-Instruct-2507-UD-Q2_K_XL:latest

Use in your applications:

curl http://localhost:11434/api/generate -d '{
  "model": "erbano/Qwen3-235B-A22B-Instruct-2507-UD-Q2_K_XL:latest",
  "prompt": "What is quantum computing?",
  "stream": false
}'

Python example:

import ollama

response = ollama.chat(
    model='erbano/Qwen3-235B-A22B-Instruct-2507-UD-Q2_K_XL:latest',
    messages=[
        {'role': 'user', 'content': 'Explain machine learning in simple terms'}
    ]
)
print(response['message']['content'])

Configuration

The model uses the following optimized parameters:

Temperature: 0.7
Top P: 0.8
Top K: 20
Min P: 0.0
Context Window: 262,144 tokens

Performance Notes

Quantization Quality: This model uses Unsloth Dynamic 2.0 (UD-Q2_K_XL) quantization.
- Expected Performance: While specific benchmark scores for this quantized version are not published, Unsloth states that the Dynamic 2.0 methodology delivers “minimal accuracy loss” compared to the unquantized FP16 model on benchmarks like 5-shot MMLU and KL Divergence.
- Comparison: It is designed to significantly outperform standard 2-bit quantizations (like Q2_K) which typically suffer from noticeable degradation.
Base Model Performance: The unquantized Qwen3-235B-A22B-Instruct-2507 achieves state-of-the-art results (reference for potential ceiling):
- MMLU-Redux: 93.1
- MMLU-Pro: 83.0
- LiveCodeBench: 51.8
- Arena-Hard v2: 79.2
Context Window: Supports up to 262,144 tokens natively.

Instruct vs. Thinking Mode

This model is the Instruct variant, designed for direct interaction. It differs from the “Thinking” (Reasoning) variants of Qwen3 as follows:

Pros of Instruct Mode (This Model)

Speed: Significantly faster response times as it skips the hidden “thought generation” process.
Efficiency: Consumes fewer tokens per request, making it more cost-effective and faster to run.
Predictability: Outputs are direct and follow standard chat templates without requiring parsing of <think> tags. Thinking models may not always close their reasoning tags properly, or may not conclude their response.
General Purpose: Ideal for RAG, creative writing, roleplay, and general knowledge tasks where deep reasoning is not the primary bottleneck.

Cons vs. Thinking Mode

Complex Reasoning: May underperform on extremely difficult math, logic, or coding tasks that benefit from “Chain of Thought” exploration.
Self-Correction: Lacks the internal monologue mechanism to catch and correct its own errors before outputting the final answer.

Verdict: Choose this Instruct model for a balance of speed and performance in general applications. Choose a Thinking model only for specialized, high-complexity reasoning tasks.

Hardware Requirements

RAM/VRAM: The model file is approximately 88 GB.
- Minimum: ~96 GB system RAM (for CPU-only inference) or VRAM (for GPU inference).
- Recommended: 128 GB+ RAM/VRAM to accommodate context window overhead.
Storage: At least 90 GB of free disk space.

License

Apache 2.0 - See original model license at Qwen3-235B-A22B-Instruct-2507

Citation

If you use this model, please cite the original Qwen3 Technical Report:

@misc{qwen3technicalreport,
      title={Qwen3 Technical Report}, 
      author={Qwen Team},
      year={2025},
      eprint={2505.09388},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.09388}, 
}

Source

Original Model: Qwen/Qwen3-235B-A22B-Instruct-2507
GGUF Quantization: unsloth/Qwen3-235B-A22B-Instruct-2507-GGUF

Ollama model based on Unsloth's UD-Q2_K_XL quantization of Qwen3-235B-A22B-Instruct-2507.

Models

Readme