113 Downloads Updated 3 days ago
ollama run odytrice/qwen3.6-35b:5090
ollama launch claude --model odytrice/qwen3.6-35b:5090
ollama launch codex-app --model odytrice/qwen3.6-35b:5090
ollama launch openclaw --model odytrice/qwen3.6-35b:5090
ollama launch hermes --model odytrice/qwen3.6-35b:5090
ollama launch codex --model odytrice/qwen3.6-35b:5090
ollama launch opencode --model odytrice/qwen3.6-35b:5090
Qwen 3.6 35B-A3B (MoE, 35B total / 3B active, 256 experts), vision + thinking + native tool calling, 262K native context.
Model card for odytrice/qwen3.6-35b:5090. The 5090 is the only profile
in this set - at ~23 GB of Q4 weights the model does not fit comfortably
on a 24 GB 4090.
| Field | Value |
|---|---|
| Upstream | Qwen/Qwen3.6-35B-A3B |
| NVFP4 sources | unsloth/Qwen3.6-35B-A3B-NVFP4, RedHatAI/Qwen3.6-35B-A3B-NVFP4 |
| Family | Qwen 3.6 (Alibaba) |
| Architecture | Mixture-of-Experts (A3B) |
| Total / Active params | 35B / 3B |
| Experts | 256 (8 routed + 1 shared) |
| Layers | 40 (hybrid: Gated DeltaNet + Gated Attention + MoE) |
| Modalities | Text + Image + Video (vision) |
| Languages | 100+ |
| Tool calling | Native (qwen3_coder parser) |
| Thinking mode | Default on; preserves thinking traces across turns |
| Native context | 262,144 (extensible to 1,010,000 via YaRN) |
| License | Apache 2.0 |
| Tag | GPU | Quantization | KV cache | num_ctx |
|---|---|---|---|---|
odytrice/qwen3.6-35b:5090 |
RTX 5090 (32 GB Blackwell) | Q4_K_M (~23 GB), NVFP4 future | q8_0 | 190000 |
ollama ps; if CPU% appears, drop
to 131072 or 153600 or switch KV cache to q4_0.The A3B suffix in Qwen3.6-35B-A3B means 3B activated parameters per
token out of 35B total. Per the Qwen team’s HF card: 256 experts
(8 routed + 1 shared), 40 layers with a hybrid Gated DeltaNet + Gated
Attention + MoE layout. Considerably faster per token than the dense
31B-class models in this set.
Per the Qwen team’s published guidance:
# Thinking mode - general tasks (default)
temperature 1.0
top_p 0.95
top_k 20
min_p 0.0
presence_penalty 1.5
repetition_penalty 1.0
# Thinking mode - precise coding (e.g. WebDev)
temperature 0.6
top_p 0.95
top_k 20
presence_penalty 0.0
# Instruct (non-thinking) mode
temperature 0.7
top_p 0.80
top_k 20
presence_penalty 1.5
Output length: 32,768 tokens default; 81,920 for hard math/code.
To preserve thinking across turns: chat_template_kwargs={"preserve_thinking": True}.
preserve_thinking for agent scenarios - retains reasoning across turnsollama ps before long runs