Details

Updated 1 week ago

1 week ago

cee2e6461ff3 · 15GB ·

model

archqwen35

parameters26.9B

quantizationIQ4_XS

15GB

template

{{- $lastUserIdx := -1 -}} {{- range $idx, $msg := .Messages -}} {{- if eq $msg.Role "user" }}{{ $la

1.7kB

system

You are a helpful AI assistant.

31B

params

{ "num_ctx": 131072, "stop": [ "<|im_end|>", "<|endoftext|>", "<|im_

144B

Qwen 3.6 27B Dense — Quantized by BatiAI

“Flagship Coding in a 27B Dense Package.” imatrix-calibrated GGUF quantizations of the official Qwen/Qwen3.6-27B (Dense, Apache 2.0), released 2026-04-22 by Alibaba. Free, unlimited, on-device AI for Mac via BatiFlow.

Multimodal-capable (vision) via separate mmproj on Hugging Face. Ollama ships text-only.

Available tags

Tag	Size	Min RAM	Use case
`:iq3`	11 GB	24 GB	Smallest footprint
`:q3`	13 GB	24 GB	K-quant alt for iq3
`:iq4`	15 GB	24 GB	⚠ currently slow on Apple Metal — see note
`:q4`	16 GB	24 GB	Recommended on Mac (best speed/quality)
`:q6`	21 GB	32 GB+	Near-BF16 quality

All 5 are imatrix-calibrated (wikitext-2-raw). Every tag supports tools + thinking; Qwen 3.6 thinks by default — pass "think": false in /api/chat to skip the <think> block for low-latency tool calls. The legacy Qwen 3.5 /no_think prompt prefix does NOT work on 3.6.

Quick Start

ollama pull batiai/qwen3.6-27b:iq4
ollama run  batiai/qwen3.6-27b:iq4

Dense vs MoE — which to pull?

	Qwen 3.6 27B (this model)	Qwen 3.6 35B-A3B
Architecture	Dense 27B	MoE, 3B active / 35B total
Typical M4 Max gen	16-18 t/s	~45-50 t/s
Strength	single-pass dense-reasoning quality, long-horizon agents	interactive chat, streaming, lower RAM
Best for	batch tool-use, code-review loops, offline generation	default BatiFlow chat, live RAG

Both Apache 2.0, both with tools + thinking + 262 K context. Pull 27B when per-token latency matters less than maximum dense-model quality.

Why Qwen 3.6 27B Dense?

Upstream positions it as “flagship coding in a dense package”. Alibaba reports the 27B dense matches or beats the previous-generation 397B-A17B MoE on major agentic-coding benchmarks — 14× smaller total footprint for equivalent reasoning on long-horizon coding tasks.

SWE-bench / Terminal-Bench / QwenWebBench — flagship-tier among dense open models (see full numbers on the HF card)
262 K native context (1 M with YaRN) — whole-repo reasoning
Thinking mode (default ON) — step-by-step reasoning before answer
Function calling via qwen3_coder parser — works with BatiFlow Tools
Multimodal via separate mmproj on HF (Ollama text-only)
Apache 2.0 — commercial-friendly

RAM guide

Your Mac	`:iq3` 11G	`:q3` 13G	`:iq4` 15G	`:q4` 16G	`:q6` 21G
16 GB	❌ swap-bound (0.02 t/s measured)	❌	❌	❌	❌
24 GB	✅	✅	✅ (slow — see Metal note)	✅	❌
32 GB	✅	✅	✅	✅	✅ tight
48 GB+	✅	✅	✅	✅	✅ comfortable

16 GB Mac: this model is not for you. Dense 27B + KV cache + macOS exceeds 16 GB unified memory; measured at 0.02 t/s (~30 min for a short greeting). Use smaller BatiAI models on 16 GB Macs — Qwen 3.5 9B, Gemma 4 E4B-it, etc.

Measured performance

Apple Silicon (measured via `ollama --verbose`, thinking-on)

Hardware	Quant	Gen (warm)	Prompt eval	Cold load	Ollama RAM
M4 Max 128 GB	IQ3_XXS	17.83 t/s	108.7 t/s	5.0 s	24 GB
M4 Max 128 GB	Q3_K_M	15.30 t/s	111.7 t/s	6.6 s	26 GB
M4 Max 128 GB	IQ4_XS ⚠	5.52 t/s	82.5 t/s	8.0 s	28 GB
M4 Max 128 GB	Q4_K_M	16.56 t/s	114.5 t/s	8.3 s	29 GB
Mac mini M4 16 GB	IQ3_XXS	0.02 t/s ❌	0.6 t/s	16 s	swap-bound

All 5 quants produce valid tool-call JSON when "think": false is passed in /api/chat (or when using the updated test-qwen3.6-27b.sh script, which sets it). Real BatiFlow flows always pass think: false for tool calls, so this is the correct usage pattern.

⚠ IQ4_XS is currently slow on Apple M-series — upstream regression

IQ4_XS = 5.52 t/s vs Q4_K_M = 16.56 t/s on the same M4 Max is a known upstream llama.cpp / Metal kernel regression documented in llama.cpp issue #21655 (~3.8× slowdown between tags b8680 → current). Same quant runs at expected speed on older builds and NVIDIA (within 10 % of Q4_K_M). When the fix lands upstream, existing :iq4 will speed up without re-pulling.

Until then on Apple Silicon: pull :q4 (Q4_K_M), not :iq4.

Recommended tag by Mac size

Your Mac	Pull
16 GB	❌ not this model — too small (use `qwen3.5-9b` or `gemma4-e4b`)
24 GB	`batiai/qwen3.6-27b:iq3` or `:q3`
32 GB	`batiai/qwen3.6-27b:q4` ← best speed/quality combo
48 GB+	`batiai/qwen3.6-27b:q4` (interactive) or `:q6` (max quality)

Server reference — BatiAI build rig (2× RTX 6000 Ada 48 GB = 96 GB total)

Measured with llama-cli --reasoning off, build bafae2765, thinking OFF:

Single GPU (models fit in one 48 GB card — fastest configuration):

Quant	Gen t/s	Load	VRAM (4 K ctx)
IQ3_XXS	97.4	5 s	~12 GB
Q3_K_M	88.2	8 s	~15 GB
IQ4_XS	85.7	9 s	~16 GB
Q4_K_M	79.0	10 s	~18 GB
Q6_K	64.1	13 s	~23 GB

Dual-GPU tensor-split (Q6_K reference): 35.6 t/s — 45 % slower than single-GPU because splitting a 23 GB model that already fits in one 48 GB card adds tensor-parallel communication overhead with zero memory benefit. Tensor-split is for models too large for one card (e.g. Qwen 3.6-35B-A3B long-context or 1 T+ MoE), not for speedup on 27 B. Use CUDA_VISIBLE_DEVICES=1 for inference on this lineup.

Mac reaches ~20 % of single-GPU server throughput — expected for memory-bandwidth-bound dense 27 B.

Benchmark It Yourself

ollama run batiai/qwen3.6-27b:iq4 --verbose "Write a haiku about Seoul in autumn."

--verbose prints prompt-eval rate, token-gen rate, memory.

Full BatiAI harness:

./test-qwen3.6-27b.sh                 # iq3 iq4 q3 q4 by default
./test-qwen3.6-27b.sh iq4             # one tag
./test-qwen3.6-27b.sh iq3 iq4 q4 q6   # pick

(Download this single script from the batiai-models repo. Nothing else needed on the Mac.)

Why BatiAI?

Quantized directly from official Qwen BF16 weights — no re-quantization of someone else’s GGUF
IQ + K-quant variants share the same wikitext-2-raw imatrix recipe as every BatiAI model
Verified on real Apple Silicon, tool-calling validated for BatiFlow’s 57 tool functions

Why text-only on Ollama?

Upstream Qwen 3.6-27B is multimodal. GGUF splits this into two files (main + mmproj.gguf), and Ollama’s mmproj integration is still rough. On Ollama we ship text tower only — single file, one ollama pull, covers every BatiFlow use case (chat, code, tools, RAG).

Need images? Download the mmproj-*-Q6_K.gguf separately from Hugging Face and run via llama-server --mmproj … — OCR, image captioning, visual reasoning.

Benchmark it yourself

ollama run batiai/qwen3.6-27b:q4 --verbose "Write a haiku about Seoul in autumn."

Full BatiAI harness — single script, nothing else needed on the Mac:

curl -O https://raw.githubusercontent.com/batiai/batiai-models/main/test-qwen3.6-27b.sh
chmod +x test-qwen3.6-27b.sh
./test-qwen3.6-27b.sh            # iq3 iq4 q3 q4 (4 tags, ~10 min)

Share reports/bench-qwen3.6-27b-*.json — we add your hardware row to the Hugging Face card.

About the “3.6” naming

Qwen released this publicly as 3.6, but the HF config uses the transitional class name Qwen3_5ForConditionalGeneration internally. llama.cpp converts via Qwen3_5TextModel — same code path as the 35B-A3B sibling.

Built for BatiFlow

flow.bati.ai — free, on-device AI automation for Mac. 5 MB app, 100 % local, unlimited.