A coding-optimized configuration of Qwen3.5-9B designed for 16 GB single-GPU hardware. The model uses the official Q4_K_M quantization (~6.6 GB weights), leaving ~9 GB headroom for KV cache — enabling 32K+ context windows comfortably.

Details

Updated 7 hours ago

7 hours ago

d1eccde3ec84 · 6.6GB ·

model

archqwen35

parameters9.65B

quantizationQ4_K_M

6.6GB

license

Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR US

11kB

license

Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR US

11kB

params

{ "frequency_penalty": 0, "min_p": 0, "num_ctx": 32768, "num_gpu": 99, "presence

204B

Qwen3.5-9B — Ollama Model Configuration (16 GB)

DESCRIPTION

A coding-optimized configuration of Qwen3.5-9B designed for 16 GB single-GPU hardware. The model uses the official Q4_K_M quantization (~6.6 GB weights), leaving ~9 GB headroom for KV cache — enabling 32K+ context windows comfortably.

Model: oamazonasgabriel/qwen3.5-9b:q4-16gbGPU

Why this model?

9B dense model — all parameters active per token (no MoE), giving you full model capability on every response
Only 6.6 GB at Q4_K_M — fits easily on any 16 GB GPU (RTX 4060 Ti, RTX 5060, RTX 4080, etc.)
Massive headroom — ~9 GB free for KV cache, enabling 32K+ context
Native multimodal — handles text, images, and video without separate vision models
Coding-optimized sampling — temperature 0.7, top_p 0.8, repeat_penalty 1.05 for deterministic code
Tool calling — works with OpenCode, Cline, Qwen Code, and other coding agents
Thinking mode — toggleable via enable_thinking API parameter

Key Features

Dense Architecture: 9B parameters, all active — no quality loss from MoE routing
Fast Generation: ~40-80 tok/s on 16 GB GPU
Tool Calling: Native support for coding agents
Thinking Mode: Optional (toggle via API)
Long Context: 262K native (32K configured for VRAM budget)
Multimodal: Text + Image + Video (natively, from the same weights)

TARGET HARDWARE

GPU	Memory	Example Models
Single GPU	16 GB VRAM	RTX 4060 Ti 16GB, RTX 5060 16GB, RTX 4080, A4000, etc.
Apple Silicon	16 GB+ unified	M2/M3/M4 Pro or Max

Requires NVIDIA driver 525+ (550+ recommended).

MODEL SELECTION RATIONALE

Qwen3.5-9B is the sweet spot for 16 GB hardware:

Quantization	Weight Size	VRAM Needed	Context (q4_0)	Fits 16 GB?
Q4_K_M (default)	~6.6 GB	~8 GB	32K+ ✅	✅ Yes — recommended
Q8_0	~9-10 GB	~11 GB	16K-32K ✅	✅ Yes
F16 (BF16)	~18 GB	~20 GB	—	❌ No

The official qwen3.5:9b at Q4_K_M is the ideal choice — small enough to leave massive headroom, yet capable enough to beat models 3x its size on coding and reasoning benchmarks.

MODEFILE

The custom Modelfile is at Qwen3.5-9B/Modelfile. Key settings:

Parameter	Value	Rationale
`FROM`	`qwen3.5:9b`	Official 6.6 GB Q4_K_M — fits 16 GB with headroom
`num_ctx`	32768	32K context with lots of headroom on 16 GB
`num_gpu`	99	Offload all 32 layers to GPU
`temperature`	0.7	Lower for deterministic, reliable code output
`top_p`	0.8	Narrow sampling for focused code generation
`top_k`	20	Standard for coding tasks
`min_p`	0.0	No minimum probability filter
`presence_penalty`	0.0	No topic diversity forcing for code
`repeat_penalty`	1.05	Slight penalty to avoid repetition in code
`frequency_penalty`	0.0	No frequency penalty
`stop`	`<\\|im_start\\|>`, `<\\|im_end\\|>`	Chat template stop tokens

VRAM BUDGET ANALYSIS

Component	Size
`qwen3.5:9b` weights (Q4_K_M)	~6.6 GB
KV cache (q4_0, 32K context)	~3-4 GB
KV cache (q4_0, 64K context)	~6-8 GB
KV cache (f16, 32K context)	~6-8 GB
Ollama process overhead	~0.5-1 GB
Total (q4_0 + 32K)	~10-11 GB ✅ lots of headroom
Total (q4_0 + 64K)	~13-15 GB ⚠️ tight but fits
Total (f16 + 32K)	~13-15 GB ⚠️ tight but fits

Note: About 1 GB of baseline VRAM is consumed by the desktop environment (X11/Wayland, browser, etc.). Even accounting for this, q4_0 + 32K fits with ~4-5 GB to spare.

COMPLETE SETUP GUIDE

1. Install Ollama

# Ubuntu/Debian
curl -fsSL https://ollama.com/install.sh | sh

# macOS
brew install ollama

# Verify
ollama --version   # Should be 0.30.6 or later

2. Set Environment Variables (Recommended)

# KV Cache at q4_0 — halves memory vs default f16
# Only needed if you want maximum context on 16 GB
export OLLAMA_KV_CACHE_TYPE=q4_0

# Single GPU — no special scheduler needed
# Limit to one parallel request (optional)
export OLLAMA_NUM_PARALLEL=1
export OLLAMA_MAX_LOADED_MODELS=1

Permanent setup (systemd):

sudo systemctl edit ollama

Add:

[Service]
Environment="OLLAMA_KV_CACHE_TYPE=q4_0"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=1"

Then:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Permanent setup (macOS launchd):

launchctl setenv OLLAMA_KV_CACHE_TYPE q4_0
launchctl setenv OLLAMA_NUM_PARALLEL 1
launchctl setenv OLLAMA_MAX_LOADED_MODELS 1
# Quit and restart Ollama from Applications

3. Start the Ollama Server

# Start (if not running via systemd)
ollama serve

Verify it’s running:

curl -s http://127.0.0.1:11434/api/version
# → {"version":"0.30.6"}

4. Verify GPU Detection

# Check Ollama detects your GPU via CUDA
ollama ps  # (run after model is loaded)
# Or check server logs:
journalctl -u ollama --no-pager | grep "inference compute"

Expected output:

inference compute ... library=CUDA compute=8.9 name=CUDA0 description="NVIDIA GeForce RTX 4060 Ti"

If you see “library=Vulkan” instead, ensure: - NVIDIA driver is properly installed (525+) - Ollama was restarted after setting env vars

5. Pull & Create (One-Time Setup)

# Pull the base model (~6.6 GB)
ollama pull qwen3.5:9b

# Create the custom model from the Modelfile
ollama create oamazonasgabriel/qwen3.5-9b:q4-16gbGPU \
  -f Qwen/Qwen3.5-9B/Modelfile

6. Free GPU Memory

⚠️ Other GPU-using applications (LM Studio, training scripts, browsers) consume VRAM. Check with nvidia-smi before loading:

nvidia-smi

If VRAM is occupied, quit competing processes:

kill $(pgrep -f lm-studio) 2>/dev/null

7. Test the Model

Via CLI:

ollama run oamazonasgabriel/qwen3.5-9b:q4-16gbGPU

Via REST API:

curl -s http://127.0.0.1:11434/api/generate \
  -d '{
    "model": "oamazonasgabriel/qwen3.5-9b:q4-16gbGPU",
    "prompt": "Write a function to merge two sorted arrays in Python",
    "stream": false
  }'

Via OpenAI-compatible endpoint:

curl -s http://127.0.0.1:11434/v1/chat/completions \
  -d '{
    "model": "oamazonasgabriel/qwen3.5-9b:q4-16gbGPU",
    "messages": [{"role": "user", "content": "Write a React component"}],
    "stream": false
  }'

USAGE EXAMPLES

CLI

# Interactive chat (coding mode)
ollama run oamazonasgabriel/qwen3.5-9b:q4-16gbGPU

# Single code prompt
ollama run oamazonasgabriel/qwen3.5-9b:q4-16gbGPU \
  "Write a Python function that implements binary search"

# With thinking mode enabled
ollama run oamazonasgabriel/qwen3.5-9b:q4-16gbGPU \
  --system "Always think step by step before answering"

Python

import ollama

# Chat with code generation
response = ollama.chat(
    model='oamazonasgabriel/qwen3.5-9b:q4-16gbGPU',
    messages=[{'role': 'user', 'content': 'Write a fast API endpoint'}],
)
print(response.message.content)

# With thinking mode
response = ollama.chat(
    model='oamazonasgabriel/qwen3.5-9b:q4-16gbGPU',
    messages=[{'role': 'user', 'content': 'Debug this code: ...'}],
    options={'enable_thinking': True},
)

OpenCode Integration

Add to ~/.config/opencode/opencode.jsonc:

"oamazonasgabriel/qwen3.5-9b:q4-16gbGPU": {
  "name": "oamazonasgabriel/qwen3.5-9b:q4-16gbGPU",
  "options": {
    "supportsThinking": true,
    "contextWindow": 32768
  }
}

Use as agent:

"model": "ollama/oamazonasgabriel/qwen3.5-9b:q4-16gbGPU"

CONTEXT WINDOW SCALING

With the 16 GB VRAM budget and q4_0 KV cache:

Context	KV Cache	Fits 16 GB?	Notes
8,192	~0.7 GB	✅ Lots of headroom	~8 GB total
16,384	~1.5 GB	✅ Plenty of headroom	~9 GB total
32,768	~3-4 GB	✅ Recommended	~10-11 GB total
65,536	~6-8 GB	⚠️ Tight	~13-15 GB total — monitor with `nvidia-smi`
131,072	~12-16 GB	❌ No	CPU offload required
262,144 (native)	~24+ GB	❌ No	Needs 32 GB+ VRAM

PERFORMANCE

On a 16 GB single-GPU setup: - Prompt processing: ~40-60 tok/s (depends on prompt length) - Text generation: ~40-80 tok/s (dense model, all 9B active) - Model load time: ~10-15 seconds (6.6 GB load) - First token latency: Low — model loads fast due to small size

The dense architecture means every token uses the full 9B parameters, giving you maximum quality per response — no routing decisions or expert sparsity.

MONITORING

# Check model is fully GPU-resident
ollama ps
# If CPU% appears, reduce context or switch KV cache type

# Real-time GPU usage
watch -n 1 nvidia-smi

# GPU memory query
nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv

# Ollama server logs (Linux)
journalctl -u ollama -n 50 --no-pager

TROUBLESHOOTING

Model fails to load — “unable to allocate buffer”

Cause: Not enough free VRAM. Another process (LM Studio, browser, Xorg) is consuming GPU memory. Fix: Kill competing GPU processes. Free at least 12 GB before loading.

OOM / Model does not load

This is unlikely on 16 GB with Q4_K_M, but if it happens:

# Try with reduced context
ollama run oamazonasgabriel/qwen3.5-9b:q4-16gbGPU -- --num-ctx 16384

CPU offloading detected

ollama ps shows CPU usage for the model. Reduce num_ctx or ensure OLLAMA_KV_CACHE_TYPE=q4_0 is set.

Thinking mode not working

Thinking mode is off by default in this configuration. Enable it via API:

ollama.chat(..., options={'enable_thinking': True})

Or in CLI: Add “Think step by step” to your prompt.

Slow generation

On 16 GB hardware, you should get 40-80 tok/s. If significantly slower: - Check for CPU offloading (ollama ps) - Ensure GPU is detected via CUDA, not Vulkan - Check another process isn’t consuming GPU compute

Tool calling fails

Ensure you’re using the correct prompt format. This model uses the Qwen chat template with <|im_start|> and <|im_end|> tokens.

PUSH TO REGISTRY

ollama push oamazonasgabriel/qwen3.5-9b:q4-16gbGPU

QUICK REFERENCE

# Full setup from scratch
ollama pull qwen3.5:9b
ollama create oamazonasgabriel/qwen3.5-9b:q4-16gbGPU \
  -f Qwen/Qwen3.5-9B/Modelfile
ollama run oamazonasgabriel/qwen3.5-9b:q4-16gbGPU

# Or just pull from registry (once pushed)
ollama pull oamazonasgabriel/qwen3.5-9b:q4-16gbGPU
ollama run oamazonasgabriel/qwen3.5-9b:q4-16gbGPU

ALTERNATIVE MODELS

If you need more capability or have more VRAM:

# Qwen3.6-35B-A3B MoE (needs 24 GB)
ollama pull oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU

# Higher quality Qwen3.5-9B quant
ollama pull qwen3.5:9b  # Default Q4_K_M
ollama pull hf.co/bartowski/Qwen3.5-9B-Instruct-GGUF:Q8_0  # Near-lossless, ~9 GB

LINKS

Resource	URL
This model on Ollama	https://ollama.com/oamazonasgabriel/qwen3.5-9b
Upstream model (Ollama)	https://ollama.com/library/qwen3.5:9b
Upstream model (HuggingFace)	https://huggingface.co/Qwen/Qwen3.5-9B
Ollama documentation	https://docs.ollama.com
Contributor ’s GitHub	https://github.com/oamazonasgabriel
Built by	impacte.tech