2,181 Downloads Updated 1 week ago
ollama run batiai/qwen3.6-27b:q6
Updated 1 week ago
1 week ago
0f38e3ed2058 · 22GB ·
“Flagship Coding in a 27B Dense Package.” imatrix-calibrated GGUF quantizations of the official Qwen/Qwen3.6-27B (Dense, Apache 2.0), released 2026-04-22 by Alibaba. Free, unlimited, on-device AI for Mac via BatiFlow.
Multimodal-capable (vision) via separate mmproj on Hugging Face. Ollama ships text-only.
| Tag | Size | Min RAM | Use case |
|---|---|---|---|
:iq3 |
11 GB | 24 GB | Smallest footprint |
:q3 |
13 GB | 24 GB | K-quant alt for iq3 |
:iq4 |
15 GB | 24 GB | ⚠ currently slow on Apple Metal — see note |
:q4 |
16 GB | 24 GB | Recommended on Mac (best speed/quality) |
:q6 |
21 GB | 32 GB+ | Near-BF16 quality |
All 5 are imatrix-calibrated (wikitext-2-raw). Every tag supports tools + thinking; Qwen 3.6 thinks by default — pass "think": false in /api/chat to skip the <think> block for low-latency tool calls. The legacy Qwen 3.5 /no_think prompt prefix does NOT work on 3.6.
ollama pull batiai/qwen3.6-27b:iq4
ollama run batiai/qwen3.6-27b:iq4
| Qwen 3.6 27B (this model) | Qwen 3.6 35B-A3B | |
|---|---|---|
| Architecture | Dense 27B | MoE, 3B active / 35B total |
| Typical M4 Max gen | 16-18 t/s | ~45-50 t/s |
| Strength | single-pass dense-reasoning quality, long-horizon agents | interactive chat, streaming, lower RAM |
| Best for | batch tool-use, code-review loops, offline generation | default BatiFlow chat, live RAG |
Both Apache 2.0, both with tools + thinking + 262 K context. Pull 27B when per-token latency matters less than maximum dense-model quality.
Upstream positions it as “flagship coding in a dense package”. Alibaba reports the 27B dense matches or beats the previous-generation 397B-A17B MoE on major agentic-coding benchmarks — 14× smaller total footprint for equivalent reasoning on long-horizon coding tasks.
qwen3_coder parser — works with BatiFlow Tools| Your Mac | :iq3 11G |
:q3 13G |
:iq4 15G |
:q4 16G |
:q6 21G |
|---|---|---|---|---|---|
| 16 GB | ❌ swap-bound (0.02 t/s measured) | ❌ | ❌ | ❌ | ❌ |
| 24 GB | ✅ | ✅ | ✅ (slow — see Metal note) | ✅ | ❌ |
| 32 GB | ✅ | ✅ | ✅ | ✅ | ✅ tight |
| 48 GB+ | ✅ | ✅ | ✅ | ✅ | ✅ comfortable |
16 GB Mac: this model is not for you. Dense 27B + KV cache + macOS exceeds 16 GB unified memory; measured at 0.02 t/s (~30 min for a short greeting). Use smaller BatiAI models on 16 GB Macs — Qwen 3.5 9B, Gemma 4 E4B-it, etc.
ollama --verbose, thinking-on)| Hardware | Quant | Gen (warm) | Prompt eval | Cold load | Ollama RAM |
|---|---|---|---|---|---|
| M4 Max 128 GB | IQ3_XXS | 17.83 t/s | 108.7 t/s | 5.0 s | 24 GB |
| M4 Max 128 GB | Q3_K_M | 15.30 t/s | 111.7 t/s | 6.6 s | 26 GB |
| M4 Max 128 GB | IQ4_XS ⚠ | 5.52 t/s | 82.5 t/s | 8.0 s | 28 GB |
| M4 Max 128 GB | Q4_K_M | 16.56 t/s | 114.5 t/s | 8.3 s | 29 GB |
| Mac mini M4 16 GB | IQ3_XXS | 0.02 t/s ❌ | 0.6 t/s | 16 s | swap-bound |
All 5 quants produce valid tool-call JSON when "think": false is passed in /api/chat (or when using the updated test-qwen3.6-27b.sh script, which sets it). Real BatiFlow flows always pass think: false for tool calls, so this is the correct usage pattern.
IQ4_XS = 5.52 t/s vs Q4_K_M = 16.56 t/s on the same M4 Max is a known upstream llama.cpp / Metal kernel regression documented in llama.cpp issue #21655 (~3.8× slowdown between tags b8680 → current). Same quant runs at expected speed on older builds and NVIDIA (within 10 % of Q4_K_M). When the fix lands upstream, existing :iq4 will speed up without re-pulling.
Until then on Apple Silicon: pull :q4 (Q4_K_M), not :iq4.
| Your Mac | Pull |
|---|---|
| 16 GB | ❌ not this model — too small (use qwen3.5-9b or gemma4-e4b) |
| 24 GB | batiai/qwen3.6-27b:iq3 or :q3 |
| 32 GB | batiai/qwen3.6-27b:q4 ← best speed/quality combo |
| 48 GB+ | batiai/qwen3.6-27b:q4 (interactive) or :q6 (max quality) |
Measured with llama-cli --reasoning off, build bafae2765, thinking OFF:
Single GPU (models fit in one 48 GB card — fastest configuration):
| Quant | Gen t/s | Load | VRAM (4 K ctx) |
|---|---|---|---|
| IQ3_XXS | 97.4 | 5 s | ~12 GB |
| Q3_K_M | 88.2 | 8 s | ~15 GB |
| IQ4_XS | 85.7 | 9 s | ~16 GB |
| Q4_K_M | 79.0 | 10 s | ~18 GB |
| Q6_K | 64.1 | 13 s | ~23 GB |
Dual-GPU tensor-split (Q6_K reference): 35.6 t/s — 45 % slower than single-GPU because splitting a 23 GB model that already fits in one 48 GB card adds tensor-parallel communication overhead with zero memory benefit. Tensor-split is for models too large for one card (e.g. Qwen 3.6-35B-A3B long-context or 1 T+ MoE), not for speedup on 27 B. Use CUDA_VISIBLE_DEVICES=1 for inference on this lineup.
Mac reaches ~20 % of single-GPU server throughput — expected for memory-bandwidth-bound dense 27 B.
ollama run batiai/qwen3.6-27b:iq4 --verbose "Write a haiku about Seoul in autumn."
--verbose prints prompt-eval rate, token-gen rate, memory.
Full BatiAI harness:
./test-qwen3.6-27b.sh # iq3 iq4 q3 q4 by default
./test-qwen3.6-27b.sh iq4 # one tag
./test-qwen3.6-27b.sh iq3 iq4 q4 q6 # pick
(Download this single script from the batiai-models repo. Nothing else needed on the Mac.)
Upstream Qwen 3.6-27B is multimodal. GGUF splits this into two files (main + mmproj.gguf), and Ollama’s mmproj integration is still rough. On Ollama we ship text tower only — single file, one ollama pull, covers every BatiFlow use case (chat, code, tools, RAG).
Need images? Download the mmproj-*-Q6_K.gguf separately from Hugging Face and run via llama-server --mmproj … — OCR, image captioning, visual reasoning.
ollama run batiai/qwen3.6-27b:q4 --verbose "Write a haiku about Seoul in autumn."
Full BatiAI harness — single script, nothing else needed on the Mac:
curl -O https://raw.githubusercontent.com/batiai/batiai-models/main/test-qwen3.6-27b.sh
chmod +x test-qwen3.6-27b.sh
./test-qwen3.6-27b.sh # iq3 iq4 q3 q4 (4 tags, ~10 min)
Share reports/bench-qwen3.6-27b-*.json — we add your hardware row to the Hugging Face card.
Qwen released this publicly as 3.6, but the HF config uses the transitional class name Qwen3_5ForConditionalGeneration internally. llama.cpp converts via Qwen3_5TextModel — same code path as the 35B-A3B sibling.
flow.bati.ai — free, on-device AI automation for Mac. 5 MB app, 100 % local, unlimited.