2,708 1 week ago

tools thinking
ollama run batiai/qwen3.6-35b:iq3

Details

1 week ago

0bfe0939fc2c · 14GB ·

qwen35moe
·
34.7B
·
IQ3_XXS
You are a helpful AI assistant.
{ "num_ctx": 131072, "stop": [ "<|im_end|>", "<|endoftext|>", "<|im_
{{- $lastUserIdx := -1 -}} {{- range $idx, $msg := .Messages -}} {{- if eq $msg.Role "user" }}{{ $la

Readme

Qwen 3.6 35B-A3B — Quantized by BatiAI

“Agentic Coding Power, Now Open to All.” imatrix-calibrated quantizations of the official Qwen 3.6 35B-A3B MoE, released 2026-04-15 by Alibaba. Text-only, built directly from Alibaba BF16 weights.

🎬 Demo (55s) — Q&A + Tools + Calendar

Real on-device inference on M4 Max in three scenarios: 1. Q&A streaming — “5 tips for writing professional emails” at ~46 t/s 2. Code + file tools — Python regex function → save to file → reveal in Finder 3. Calendar — “Show me today’s schedule” → live Mac Calendar query and event add

All 100 % local through BatiFlow — one click, no code, no API keys, no subscription. Built so non-developers can use this kind of AI automation on their Mac.

Models

Tag Size Min RAM Use Case
:iq3 / :q3 13 GB 16 GB 16GB Mac mini / MacBook Air
:iq4 / :q4 18 GB 24 GB MacBook Pro / Mac Studio (recommended)
:q6 27 GB 36 GB MacBook Pro M4 Pro / Studio — max on-device quality

IQ3/IQ4 are imatrix (wikitext-calibrated) for better quality per bit at low bit-widths. Q6_K is a high-bit K-quant — near-BF16 quality for users with enough RAM.

Tool calling: all tags support tools + thinking. Pass "think": false in chat requests for fast tool-call responses.

Quick Start

ollama pull batiai/qwen3.6-35b:iq4
ollama run batiai/qwen3.6-35b:iq4

Why Qwen 3.6 35B-A3B?

Upstream positions 3.6 as a major agentic-coding upgrade over 3.5. Key numbers from Alibaba’s official BF16 benchmarks:

vs Qwen 3.5 35B-A3B — clear generation-on-generation jump

Benchmark 3.5-35B-A3B 3.6-35B-A3B Δ
SWE-bench Verified 70.0 73.4 +3.4
Terminal-Bench 2.0 40.5 51.5 +11.0
QwenWebBench 978 1397 +43%

vs Gemma 4 31B — beats it on every published coding & math test

Benchmark Gemma 4-31B Qwen 3.6-35B-A3B
SWE-bench Verified 52.0 73.4
SWE-bench Multilingual 51.7 67.2
SWE-bench Pro 35.7 49.5
Terminal-Bench 2.0 42.9 51.5
AIME26 89.2 92.7

Despite Gemma 4-31B being a similar-sized dense model, 3.6’s MoE architecture (3B active params) outpaces it on agentic coding and math while using 9× less compute per token.

Headline capabilities

  • SWE-bench Verified 73.4 — top-tier agentic coding among open models
  • AIME26 92.7 · GPQA 86.0 · HMMT Feb 26 83.6 — frontier math
  • MMLU-Pro 85.2 · MMLU-Redux 93.3 — strong general knowledge
  • Repo-level reasoning + “thinking preservation” for iterative dev
  • 262K native context (1M with YaRN)
  • Function calling via qwen3_coder parser — works with Tools in BatiFlow
  • Apache 2.0 — commercial-friendly

Upstream BF16 figures. Quantization (IQ3/IQ4) may cost a few points on the hardest benchmarks — run --verbose locally to see real tokens/s on your Mac.

MoE Advantage

Qwen3.6-35B-A3B (MoE) Typical 27B Dense
Total params 35B 27B
Active params 3B 27B
Experts 256 (8 routed + 1 shared)
Typical VRAM (IQ4) ~23 GB ~28 GB

RAM Guide

Your Mac RAM IQ3 (13GB) IQ4 (18GB)
16 GB ✅ tight
24 GB ✅ tight
32 GB
48 GB+ ✅ ideal

Measured Performance

MacBook Pro M4 Max (128 GB) — 100 % GPU

Metric IQ3_XXS IQ4_XS
Gen speed (warm) 45.9 t/s 46.5 t/s
Prompt eval 104.9 t/s 105.0 t/s
Load time 3.0 s 5.3 s
Ollama RAM 18 GB 23 GB
Tool call JSON ❌ fail ✅ pass

Mac mini M4 (16 GB) — IQ3 runs at ~2 – 3 t/s (swap pressure, single-turn only). IQ4 does not fit.

Reference: 2× RTX 6000 Ada (96 GB VRAM, Linux) — not our target, but useful as ceiling:

Metric IQ3 IQ4 Q6
Gen speed (warm) 133.0 115.4 112.3 t/s
Prompt eval 722 666 516 t/s
VRAM 18 23 33 GB

Mac M4 Max reaches ~35–40 % of server’s warm throughput despite way less power — memory-bandwidth bound, not compute.

Key take-aways

  • IQ3 ≈ IQ4 in speed on M4 Max (~1 % apart) — memory-bandwidth bound.
  • ~1.75× faster than Qwen 3.5-35B-A3B IQ4 on the same M4 Max (46.5 vs 26.6 t/s).
  • IQ3 can fail function-call JSON — if you use tool calling, pick IQ4 or Q6.
  • Q6 is the quality ceiling on Mac — 36 GB+ unified memory recommended.

Benchmark It Yourself

ollama run batiai/qwen3.6-35b:iq4 --verbose "Write a haiku about Seoul in autumn."

--verbose prints prompt-eval rate, token-gen rate, and memory.

Why BatiAI?

  • Quantized directly from official Qwen BF16 weights — no re-quantization of someone else’s GGUF
  • IQ3_XXS + IQ4_XS with imatrix (wikitext-2-raw calibration)
  • Same pipeline as every BatiAI model — verified on real Apple Silicon
  • Built for BatiFlow — 57 tool functions, tool calling validated

Why text-only?

Upstream Qwen 3.6 is multimodal — it has a vision tower (~1–2 GB extra) that handles images. Multimodal GGUFs need two files (main + mmproj.gguf), and Ollama’s mmproj integration is rough today.

We deliberately ship the text tower only: - ✅ Single file, one ollama pull works out of the box - ✅ Smaller disk / RAM footprint - ✅ Covers everything BatiFlow needs — chat, code, tool calls, RAG

Need images (OCR, captioning, visual reasoning)? Use upstream weights directly. Need text+image embedding for RAG? See batiai/qwen3-vl-embed-2b.

About the “3.6” naming

Qwen released this publicly as 3.6, but the config still uses the architecture name Qwen3_5MoeForConditionalGeneration internally (transitional class name from the 3.5 line). llama.cpp converts via Qwen3_5MoeTextModel — text tower only.

Built for BatiFlow

flow.bati.ai — free, on-device AI automation for Mac. 5MB app, 100% local, unlimited.