2,181 1 week ago

tools thinking
ollama run batiai/qwen3.6-27b:iq4

Details

1 week ago

cee2e6461ff3 · 15GB ·

qwen35
·
26.9B
·
IQ4_XS
{{- $lastUserIdx := -1 -}} {{- range $idx, $msg := .Messages -}} {{- if eq $msg.Role "user" }}{{ $la
You are a helpful AI assistant.
{ "num_ctx": 131072, "stop": [ "<|im_end|>", "<|endoftext|>", "<|im_

Readme

Qwen 3.6 27B Dense — Quantized by BatiAI

“Flagship Coding in a 27B Dense Package.” imatrix-calibrated GGUF quantizations of the official Qwen/Qwen3.6-27B (Dense, Apache 2.0), released 2026-04-22 by Alibaba. Free, unlimited, on-device AI for Mac via BatiFlow.

Multimodal-capable (vision) via separate mmproj on Hugging Face. Ollama ships text-only.

Available tags

Tag Size Min RAM Use case
:iq3 11 GB 24 GB Smallest footprint
:q3 13 GB 24 GB K-quant alt for iq3
:iq4 15 GB 24 GB ⚠ currently slow on Apple Metal — see note
:q4 16 GB 24 GB Recommended on Mac (best speed/quality)
:q6 21 GB 32 GB+ Near-BF16 quality

All 5 are imatrix-calibrated (wikitext-2-raw). Every tag supports tools + thinking; Qwen 3.6 thinks by default — pass "think": false in /api/chat to skip the <think> block for low-latency tool calls. The legacy Qwen 3.5 /no_think prompt prefix does NOT work on 3.6.

Quick Start

ollama pull batiai/qwen3.6-27b:iq4
ollama run  batiai/qwen3.6-27b:iq4

Dense vs MoE — which to pull?

Qwen 3.6 27B (this model) Qwen 3.6 35B-A3B
Architecture Dense 27B MoE, 3B active / 35B total
Typical M4 Max gen 16-18 t/s ~45-50 t/s
Strength single-pass dense-reasoning quality, long-horizon agents interactive chat, streaming, lower RAM
Best for batch tool-use, code-review loops, offline generation default BatiFlow chat, live RAG

Both Apache 2.0, both with tools + thinking + 262 K context. Pull 27B when per-token latency matters less than maximum dense-model quality.

Why Qwen 3.6 27B Dense?

Upstream positions it as “flagship coding in a dense package”. Alibaba reports the 27B dense matches or beats the previous-generation 397B-A17B MoE on major agentic-coding benchmarks — 14× smaller total footprint for equivalent reasoning on long-horizon coding tasks.

  • SWE-bench / Terminal-Bench / QwenWebBench — flagship-tier among dense open models (see full numbers on the HF card)
  • 262 K native context (1 M with YaRN) — whole-repo reasoning
  • Thinking mode (default ON) — step-by-step reasoning before answer
  • Function calling via qwen3_coder parser — works with BatiFlow Tools
  • Multimodal via separate mmproj on HF (Ollama text-only)
  • Apache 2.0 — commercial-friendly

RAM guide

Your Mac :iq3 11G :q3 13G :iq4 15G :q4 16G :q6 21G
16 GB ❌ swap-bound (0.02 t/s measured)
24 GB ✅ (slow — see Metal note)
32 GB ✅ tight
48 GB+ ✅ comfortable

16 GB Mac: this model is not for you. Dense 27B + KV cache + macOS exceeds 16 GB unified memory; measured at 0.02 t/s (~30 min for a short greeting). Use smaller BatiAI models on 16 GB Macs — Qwen 3.5 9B, Gemma 4 E4B-it, etc.

Measured performance

Apple Silicon (measured via ollama --verbose, thinking-on)

Hardware Quant Gen (warm) Prompt eval Cold load Ollama RAM
M4 Max 128 GB IQ3_XXS 17.83 t/s 108.7 t/s 5.0 s 24 GB
M4 Max 128 GB Q3_K_M 15.30 t/s 111.7 t/s 6.6 s 26 GB
M4 Max 128 GB IQ4_XS ⚠ 5.52 t/s 82.5 t/s 8.0 s 28 GB
M4 Max 128 GB Q4_K_M 16.56 t/s 114.5 t/s 8.3 s 29 GB
Mac mini M4 16 GB IQ3_XXS 0.02 t/s ❌ 0.6 t/s 16 s swap-bound

All 5 quants produce valid tool-call JSON when "think": false is passed in /api/chat (or when using the updated test-qwen3.6-27b.sh script, which sets it). Real BatiFlow flows always pass think: false for tool calls, so this is the correct usage pattern.

⚠ IQ4_XS is currently slow on Apple M-series — upstream regression

IQ4_XS = 5.52 t/s vs Q4_K_M = 16.56 t/s on the same M4 Max is a known upstream llama.cpp / Metal kernel regression documented in llama.cpp issue #21655 (~3.8× slowdown between tags b8680 → current). Same quant runs at expected speed on older builds and NVIDIA (within 10 % of Q4_K_M). When the fix lands upstream, existing :iq4 will speed up without re-pulling.

Until then on Apple Silicon: pull :q4 (Q4_K_M), not :iq4.

Recommended tag by Mac size

Your Mac Pull
16 GB ❌ not this model — too small (use qwen3.5-9b or gemma4-e4b)
24 GB batiai/qwen3.6-27b:iq3 or :q3
32 GB batiai/qwen3.6-27b:q4 ← best speed/quality combo
48 GB+ batiai/qwen3.6-27b:q4 (interactive) or :q6 (max quality)

Server reference — BatiAI build rig (2× RTX 6000 Ada 48 GB = 96 GB total)

Measured with llama-cli --reasoning off, build bafae2765, thinking OFF:

Single GPU (models fit in one 48 GB card — fastest configuration):

Quant Gen t/s Load VRAM (4 K ctx)
IQ3_XXS 97.4 5 s ~12 GB
Q3_K_M 88.2 8 s ~15 GB
IQ4_XS 85.7 9 s ~16 GB
Q4_K_M 79.0 10 s ~18 GB
Q6_K 64.1 13 s ~23 GB

Dual-GPU tensor-split (Q6_K reference): 35.6 t/s — 45 % slower than single-GPU because splitting a 23 GB model that already fits in one 48 GB card adds tensor-parallel communication overhead with zero memory benefit. Tensor-split is for models too large for one card (e.g. Qwen 3.6-35B-A3B long-context or 1 T+ MoE), not for speedup on 27 B. Use CUDA_VISIBLE_DEVICES=1 for inference on this lineup.

Mac reaches ~20 % of single-GPU server throughput — expected for memory-bandwidth-bound dense 27 B.

Benchmark It Yourself

ollama run batiai/qwen3.6-27b:iq4 --verbose "Write a haiku about Seoul in autumn."

--verbose prints prompt-eval rate, token-gen rate, memory.

Full BatiAI harness:

./test-qwen3.6-27b.sh                 # iq3 iq4 q3 q4 by default
./test-qwen3.6-27b.sh iq4             # one tag
./test-qwen3.6-27b.sh iq3 iq4 q4 q6   # pick

(Download this single script from the batiai-models repo. Nothing else needed on the Mac.)

Why BatiAI?

  • Quantized directly from official Qwen BF16 weights — no re-quantization of someone else’s GGUF
  • IQ + K-quant variants share the same wikitext-2-raw imatrix recipe as every BatiAI model
  • Verified on real Apple Silicon, tool-calling validated for BatiFlow’s 57 tool functions

Why text-only on Ollama?

Upstream Qwen 3.6-27B is multimodal. GGUF splits this into two files (main + mmproj.gguf), and Ollama’s mmproj integration is still rough. On Ollama we ship text tower only — single file, one ollama pull, covers every BatiFlow use case (chat, code, tools, RAG).

Need images? Download the mmproj-*-Q6_K.gguf separately from Hugging Face and run via llama-server --mmproj … — OCR, image captioning, visual reasoning.

Benchmark it yourself

ollama run batiai/qwen3.6-27b:q4 --verbose "Write a haiku about Seoul in autumn."

Full BatiAI harness — single script, nothing else needed on the Mac:

curl -O https://raw.githubusercontent.com/batiai/batiai-models/main/test-qwen3.6-27b.sh
chmod +x test-qwen3.6-27b.sh
./test-qwen3.6-27b.sh            # iq3 iq4 q3 q4 (4 tags, ~10 min)

Share reports/bench-qwen3.6-27b-*.json — we add your hardware row to the Hugging Face card.

About the “3.6” naming

Qwen released this publicly as 3.6, but the HF config uses the transitional class name Qwen3_5ForConditionalGeneration internally. llama.cpp converts via Qwen3_5TextModel — same code path as the 35B-A3B sibling.

Built for BatiFlow

flow.bati.ai — free, on-device AI automation for Mac. 5 MB app, 100 % local, unlimited.