Details

Updated 1 week ago

1 week ago

0bfe0939fc2c · 14GB ·

model

archqwen35moe

parameters34.7B

quantizationIQ3_XXS

14GB

system

You are a helpful AI assistant.

31B

params

{ "num_ctx": 131072, "stop": [ "<|im_end|>", "<|endoftext|>", "<|im_

144B

template

{{- $lastUserIdx := -1 -}} {{- range $idx, $msg := .Messages -}} {{- if eq $msg.Role "user" }}{{ $la

1.7kB

Qwen 3.6 35B-A3B — Quantized by BatiAI

“Agentic Coding Power, Now Open to All.” imatrix-calibrated quantizations of the official Qwen 3.6 35B-A3B MoE, released 2026-04-15 by Alibaba. Text-only, built directly from Alibaba BF16 weights.

🎬 Demo (55s) — Q&A + Tools + Calendar

Real on-device inference on M4 Max in three scenarios: 1. Q&A streaming — “5 tips for writing professional emails” at ~46 t/s 2. Code + file tools — Python regex function → save to file → reveal in Finder 3. Calendar — “Show me today’s schedule” → live Mac Calendar query and event add

All 100 % local through BatiFlow — one click, no code, no API keys, no subscription. Built so non-developers can use this kind of AI automation on their Mac.

Models

Tag	Size	Min RAM	Use Case
`:iq3` / `:q3`	13 GB	16 GB	16GB Mac mini / MacBook Air
`:iq4` / `:q4`	18 GB	24 GB	MacBook Pro / Mac Studio (recommended)
`:q6`	27 GB	36 GB	MacBook Pro M4 Pro / Studio — max on-device quality

IQ3/IQ4 are imatrix (wikitext-calibrated) for better quality per bit at low bit-widths. Q6_K is a high-bit K-quant — near-BF16 quality for users with enough RAM.

Tool calling: all tags support tools + thinking. Pass "think": false in chat requests for fast tool-call responses.

Quick Start

ollama pull batiai/qwen3.6-35b:iq4
ollama run batiai/qwen3.6-35b:iq4

Why Qwen 3.6 35B-A3B?

Upstream positions 3.6 as a major agentic-coding upgrade over 3.5. Key numbers from Alibaba’s official BF16 benchmarks:

vs Qwen 3.5 35B-A3B — clear generation-on-generation jump

Benchmark	3.5-35B-A3B	3.6-35B-A3B	Δ
SWE-bench Verified	70.0	73.4	+3.4
Terminal-Bench 2.0	40.5	51.5	+11.0
QwenWebBench	978	1397	+43%

vs Gemma 4 31B — beats it on every published coding & math test

Benchmark	Gemma 4-31B	Qwen 3.6-35B-A3B
SWE-bench Verified	52.0	73.4
SWE-bench Multilingual	51.7	67.2
SWE-bench Pro	35.7	49.5
Terminal-Bench 2.0	42.9	51.5
AIME26	89.2	92.7

Despite Gemma 4-31B being a similar-sized dense model, 3.6’s MoE architecture (3B active params) outpaces it on agentic coding and math while using 9× less compute per token.

Headline capabilities

SWE-bench Verified 73.4 — top-tier agentic coding among open models
AIME26 92.7 · GPQA 86.0 · HMMT Feb 26 83.6 — frontier math
MMLU-Pro 85.2 · MMLU-Redux 93.3 — strong general knowledge
Repo-level reasoning + “thinking preservation” for iterative dev
262K native context (1M with YaRN)
Function calling via qwen3_coder parser — works with Tools in BatiFlow
Apache 2.0 — commercial-friendly

Upstream BF16 figures. Quantization (IQ3/IQ4) may cost a few points on the hardest benchmarks — run --verbose locally to see real tokens/s on your Mac.

MoE Advantage

	Qwen3.6-35B-A3B (MoE)	Typical 27B Dense
Total params	35B	27B
Active params	3B	27B
Experts	256 (8 routed + 1 shared)	—
Typical VRAM (IQ4)	~23 GB	~28 GB

RAM Guide

Your Mac RAM	IQ3 (13GB)	IQ4 (18GB)
16 GB	✅ tight	❌
24 GB	✅	✅ tight
32 GB	✅	✅
48 GB+	✅	✅ ideal

Measured Performance

MacBook Pro M4 Max (128 GB) — 100 % GPU

Metric	IQ3_XXS	IQ4_XS
Gen speed (warm)	45.9 t/s	46.5 t/s
Prompt eval	104.9 t/s	105.0 t/s
Load time	3.0 s	5.3 s
Ollama RAM	18 GB	23 GB
Tool call JSON	❌ fail	✅ pass

Mac mini M4 (16 GB) — IQ3 runs at ~2 – 3 t/s (swap pressure, single-turn only). IQ4 does not fit.

Reference: 2× RTX 6000 Ada (96 GB VRAM, Linux) — not our target, but useful as ceiling:

Metric	IQ3	IQ4	Q6
Gen speed (warm)	133.0	115.4	112.3 t/s
Prompt eval	722	666	516 t/s
VRAM	18	23	33 GB

Mac M4 Max reaches ~35–40 % of server’s warm throughput despite way less power — memory-bandwidth bound, not compute.

Key take-aways

IQ3 ≈ IQ4 in speed on M4 Max (~1 % apart) — memory-bandwidth bound.
~1.75× faster than Qwen 3.5-35B-A3B IQ4 on the same M4 Max (46.5 vs 26.6 t/s).
IQ3 can fail function-call JSON — if you use tool calling, pick IQ4 or Q6.
Q6 is the quality ceiling on Mac — 36 GB+ unified memory recommended.

Benchmark It Yourself

ollama run batiai/qwen3.6-35b:iq4 --verbose "Write a haiku about Seoul in autumn."

--verbose prints prompt-eval rate, token-gen rate, and memory.

Why BatiAI?

Quantized directly from official Qwen BF16 weights — no re-quantization of someone else’s GGUF
IQ3_XXS + IQ4_XS with imatrix (wikitext-2-raw calibration)
Same pipeline as every BatiAI model — verified on real Apple Silicon
Built for BatiFlow — 57 tool functions, tool calling validated

Why text-only?

Upstream Qwen 3.6 is multimodal — it has a vision tower (~1–2 GB extra) that handles images. Multimodal GGUFs need two files (main + mmproj.gguf), and Ollama’s mmproj integration is rough today.

We deliberately ship the text tower only: - ✅ Single file, one ollama pull works out of the box - ✅ Smaller disk / RAM footprint - ✅ Covers everything BatiFlow needs — chat, code, tool calls, RAG

Need images (OCR, captioning, visual reasoning)? Use upstream weights directly. Need text+image embedding for RAG? See batiai/qwen3-vl-embed-2b.

About the “3.6” naming

Qwen released this publicly as 3.6, but the config still uses the architecture name Qwen3_5MoeForConditionalGeneration internally (transitional class name from the 3.5 line). llama.cpp converts via Qwen3_5MoeTextModel — text tower only.

Built for BatiFlow

flow.bati.ai — free, on-device AI automation for Mac. 5MB app, 100% local, unlimited.