Qwen 3.6 35B-A3B (MoE, 35B total / 3B active, 256 experts), vision + thinking + native tool calling, 262K native context

Details

Updated 3 days ago

3 days ago

d6f2dae7ffb8 · 24GB ·

model

archqwen35moe

parameters36B

quantizationQ4_K_M

24GB

license

Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR US

11kB

params

{ "min_p": 0, "num_ctx": 190000, "num_gpu": 999, "presence_penalty": 1.5, "repea

125B

Qwen 3.6 35B-A3B

Qwen 3.6 35B-A3B (MoE, 35B total / 3B active, 256 experts), vision + thinking + native tool calling, 262K native context.

Model card for odytrice/qwen3.6-35b:5090. The 5090 is the only profile in this set - at ~23 GB of Q4 weights the model does not fit comfortably on a 24 GB 4090.

Upstream

Field	Value
Upstream	`Qwen/Qwen3.6-35B-A3B`
NVFP4 sources	`unsloth/Qwen3.6-35B-A3B-NVFP4`, `RedHatAI/Qwen3.6-35B-A3B-NVFP4`
Family	Qwen 3.6 (Alibaba)
Architecture	Mixture-of-Experts (A3B)
Total / Active params	35B / 3B
Experts	256 (8 routed + 1 shared)
Layers	40 (hybrid: Gated DeltaNet + Gated Attention + MoE)
Modalities	Text + Image + Video (vision)
Languages	100+
Tool calling	Native (`qwen3_coder` parser)
Thinking mode	Default on; preserves thinking traces across turns
Native context	262,144 (extensible to 1,010,000 via YaRN)
License	Apache 2.0

Tag	GPU	Quantization	KV cache	`num_ctx`
`odytrice/qwen3.6-35b:5090`	RTX 5090 (32 GB Blackwell)	Q4_K_M (~23 GB), NVFP4 future	q8_0	190000

Architecture note

The A3B suffix in Qwen3.6-35B-A3B means 3B activated parameters per token out of 35B total. Per the Qwen team’s HF card: 256 experts (8 routed + 1 shared), 40 layers with a hybrid Gated DeltaNet + Gated Attention + MoE layout. Considerably faster per token than the dense 31B-class models in this set.

Sampling

Per the Qwen team’s published guidance:

# Thinking mode - general tasks (default)
temperature        1.0
top_p              0.95
top_k              20
min_p              0.0
presence_penalty   1.5
repetition_penalty 1.0

# Thinking mode - precise coding (e.g. WebDev)
temperature        0.6
top_p              0.95
top_k              20
presence_penalty   0.0

# Instruct (non-thinking) mode
temperature        0.7
top_p              0.80
top_k              20
presence_penalty   1.5

Output length: 32,768 tokens default; 81,920 for hard math/code. To preserve thinking across turns: chat_template_kwargs={"preserve_thinking": True}.

Strengths

MoE with only 3B active params - dramatically faster than dense 32B class
Strong agentic coding (SWE-bench Verified 73.4, SWE-bench Pro 49.5, Terminal-Bench 2.0 51.5)
Native vision: text + image + video input
preserve_thinking for agent scenarios - retains reasoning across turns
100+ languages
Apache 2.0 licensed
NVFP4 weights already published by both unsloth and Red Hat

Caveats

Tightest fit in the 5090 set - verify ollama ps before long runs
Does not fit on a 24 GB 4090 with any usable context
NVFP4 weights exist upstream but Ollama does not yet load them

Qwen 3.6 35B-A3B (MoE, 35B total / 3B active, 256 experts), vision + thinking + native tool calling, 262K native context

Details

Readme

Qwen 3.6 35B-A3B

Upstream

Tags

Why 190K (and why no 4090 tag)

Architecture note

Sampling

Strengths

Caveats

See also