130 3 days ago

Qwen 3.6 27B dense, multimodal (text + image + video), thinking + native tool calling, 262K native context.

vision tools thinking
ollama run odytrice/qwen3.6-27b:5090

Applications

Claude Code
Claude Code ollama launch claude --model odytrice/qwen3.6-27b:5090
Codex App
Codex App ollama launch codex-app --model odytrice/qwen3.6-27b:5090
OpenClaw
OpenClaw ollama launch openclaw --model odytrice/qwen3.6-27b:5090
Hermes Agent
Hermes Agent ollama launch hermes --model odytrice/qwen3.6-27b:5090
Codex
Codex ollama launch codex --model odytrice/qwen3.6-27b:5090
OpenCode
OpenCode ollama launch opencode --model odytrice/qwen3.6-27b:5090

Models

View all →

1 model

qwen3.6-27b:5090

17GB · 256K context window · Text, Image · 3 days ago

Readme

Qwen 3.6 27B

Qwen 3.6 27B dense, multimodal (text + image + video), thinking + native tool calling, 262K native context.

Model card for odytrice/qwen3.6-27b:5090. The dense 27B at Q4_K_M (~17 GB) does not leave usable KV cache headroom on a 24 GB 4090 at any practical context length, so only a 5090 profile is provided.

Upstream

Field Value
Upstream Qwen/Qwen3.6-27B
NVFP4 source unsloth/Qwen3.6-27B-NVFP4
FP8 source Qwen/Qwen3.6-27B-FP8
Family Qwen 3.6 (Alibaba)
Architecture Dense
Params ~27-28B
Modalities Text + Image + Video (vision)
Languages 100+
Tool calling Native (qwen3_coder parser in vLLM/SGLang)
Thinking mode Default on; togglable via enable_thinking
Native context 262,144 (extensible to 1,010,000 via YaRN)
License Apache 2.0

Tags

Tag GPU Quantization KV cache num_ctx
odytrice/qwen3.6-27b:5090 RTX 5090 (32 GB Blackwell) Q4_K_M (~17 GB), NVFP4 future q8_0 190000

Why this context size

190000 matches the gateway config exactly. 32 GB comfortably fits the weights plus q8_0 KV cache for 190K context. Below the model’s 262K native window - no YaRN scaling required.

Environment

Always set these before running Ollama:

set OLLAMA_KV_CACHE_TYPE=q4_0    # Windows
set OLLAMA_FLASH_ATTENTION=1

export OLLAMA_KV_CACHE_TYPE=q4_0   # Linux/macOS
export OLLAMA_FLASH_ATTENTION=1

Sampling

Per the Qwen team’s published guidance:

# Thinking mode - general tasks (default)
temperature        1.0
top_p              0.95
top_k              20
min_p              0.0
presence_penalty   1.5
repetition_penalty 1.0

# Thinking mode - precise coding (e.g. WebDev)
temperature        0.6
top_p              0.95
top_k              20
presence_penalty   0.0

# Instruct (non-thinking) mode
temperature        0.7
top_p              0.80
top_k              20
presence_penalty   1.5

To disable thinking: pass enable_thinking=False via chat_template_kwargs (vLLM/SGLang).

Strengths

  • Strong reasoning + coding balance in the dense 27B band
  • Native vision: text + image + video input
  • Native tool calling (qwen3_coder parser)
  • Thinking / non-thinking mode toggle
  • 262K native context (gated only by VRAM here)
  • 100+ languages
  • Apache 2.0 licensed

Caveats

  • NVFP4 weights exist upstream but Ollama does not yet load them

See also