851 4 weeks ago

Useful Unsloth-DQ2 quants of the smaller qwen3 models

tools thinking 1.7b 4b 8b

4 weeks ago

99291f3b3db7 · 2.4GB ·

qwen3
·
4.02B
·
Q4_0
{{- $lastUserIdx := -1 -}} {{- range $idx, $msg := .Messages -}} {{- if eq $msg.Role "user" }}{{ $la
{ "min_p": 0, "repeat_penalty": 1, "stop": [ "<|im_start|>", "<|im_end|>

Readme

Notes

I’ve uploaded additional quantizations of the qwen3 models, with two distinct variations:

  1. Thinking versions: Retain the original Qwen3’s hybrid step-by-step reasoning.

  2. Non-Thinking versions: Provide faster responses without the step-by-step. This is the 4b-2507 version.

The 1.7b model includes some standard quants as well as some Unsloth’s DynamicQuant2.0 versions, which offer superior accuracy while maintaining efficiency. All of the 4b and 8b models are Unlsoth DQ2.

To switch between the Thinking versions’ step-by-step mode and instant-response mode, add /think or /no_think to the system/user prompts.

For 1.7b (default is Q4_K_XL):

  • Q3_K_XL runs quite fast and seems usable
  • Q4_K_XL is a good sweet spot, performs much better than the official Q4_K_M
  • Q5_K_M offers a good balance of quality & efficiency
  • Q5_K_XL performs at full quality
  • Q6_K performs at full quality

(The Q5_K_M and Q6_K models were quantized from the fp16 using ollama. To take advantage of Unsloth’s DynamicQuant2.0, use the K_XL quants.)

For 4b (default is Q3_K_XL):

  • Q3_K_XL runs quite fast and should perform similarly to the official Q4_K_M release
  • Q4_K_XL is a good sweet spot, performs much better than the official Q4_K_M
  • Q5_K_XL performs at full quality

For 4b-2507 (Non-Thinking version):

  • Q2_K_XL
  • Q3_K_XL
  • Q4_0
  • Q4_K_XL
  • Q5_K_XL
  • Q8_0

For 8b (default is non-thinking 2507 version at Q4_0):

  • Q3_K_XL runs quite fast and should perform similarly to the official Q4_K_M release
  • Q4_K_M is a good sweet spot, performs much better than the official Q4_K_M
  • Q5_K_XL performs at full quality

Description

Qwen3 is the latest generation of large language models in the Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. The models feature a unique hybrid approach with two modes:

  • Thinking Mode: Takes time to reason step by step before delivering the final answer, ideal for complex problems requiring deeper thought.

  • Non-Thinking Mode: Provides quick, near-instant responses, suitable for simpler questions where speed is more important than depth.

Qwen3 models support 119 languages and dialects, making them truly multilingual. They excel at coding, math, reasoning, and agentic capabilities, with significantly improved performance over previous generations.

Key Features

  • Hybrid Thinking Modes: Switch between detailed reasoning and quick responses

  • Multilingual Support: 119 languages and dialects

  • Improved Agentic Capabilities: Enhanced tool use and environmental interaction

  • Context Length: 32K-128K tokens depending on model size

  • Open Weights: Available under Apache 2.0 license

You can dynamically switch between thinking and non-thinking modes by adding /think or /no_think to one of these three:

  • The beginning or end of the system prompt

  • The beginning or end of the user prompt

(I’m not sure if it’ll work if you put it in the middle of the prompts.)

References

qwen3

Unsloth