7 7 hours ago

A coding-optimized configuration of Qwen3.5-9B designed for 16 GB single-GPU hardware. The model uses the official Q4_K_M quantization (~6.6 GB weights), leaving ~9 GB headroom for KV cache — enabling 32K+ context windows comfortably.

vision tools thinking
ollama run oamazonasgabriel/qwen3.5-9b:q4-16gbGPU

Details

7 hours ago

d1eccde3ec84 · 6.6GB ·

qwen35
·
9.65B
·
Q4_K_M
Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR US
Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR US
{ "frequency_penalty": 0, "min_p": 0, "num_ctx": 32768, "num_gpu": 99, "presence

Readme

Qwen3.5-9B — Ollama Model Configuration (16 GB)

DESCRIPTION

A coding-optimized configuration of Qwen3.5-9B designed for 16 GB single-GPU hardware. The model uses the official Q4_K_M quantization (~6.6 GB weights), leaving ~9 GB headroom for KV cache — enabling 32K+ context windows comfortably.

Model: oamazonasgabriel/qwen3.5-9b:q4-16gbGPU

Why this model?

  • 9B dense model — all parameters active per token (no MoE), giving you full model capability on every response
  • Only 6.6 GB at Q4_K_M — fits easily on any 16 GB GPU (RTX 4060 Ti, RTX 5060, RTX 4080, etc.)
  • Massive headroom — ~9 GB free for KV cache, enabling 32K+ context
  • Native multimodal — handles text, images, and video without separate vision models
  • Coding-optimized sampling — temperature 0.7, top_p 0.8, repeat_penalty 1.05 for deterministic code
  • Tool calling — works with OpenCode, Cline, Qwen Code, and other coding agents
  • Thinking mode — toggleable via enable_thinking API parameter

Key Features

  • Dense Architecture: 9B parameters, all active — no quality loss from MoE routing
  • Fast Generation: ~40-80 tok/s on 16 GB GPU
  • Tool Calling: Native support for coding agents
  • Thinking Mode: Optional (toggle via API)
  • Long Context: 262K native (32K configured for VRAM budget)
  • Multimodal: Text + Image + Video (natively, from the same weights)

TARGET HARDWARE

GPU Memory Example Models
Single GPU 16 GB VRAM RTX 4060 Ti 16GB, RTX 5060 16GB, RTX 4080, A4000, etc.
Apple Silicon 16 GB+ unified M2/M3/M4 Pro or Max

Requires NVIDIA driver 525+ (550+ recommended).


MODEL SELECTION RATIONALE

Qwen3.5-9B is the sweet spot for 16 GB hardware:

Quantization Weight Size VRAM Needed Context (q4_0) Fits 16 GB?
Q4_K_M (default) ~6.6 GB ~8 GB 32K+ ✅ ✅ Yes — recommended
Q8_0 ~9-10 GB ~11 GB 16K-32K ✅ ✅ Yes
F16 (BF16) ~18 GB ~20 GB ❌ No

The official qwen3.5:9b at Q4_K_M is the ideal choice — small enough to leave massive headroom, yet capable enough to beat models 3x its size on coding and reasoning benchmarks.


MODEFILE

The custom Modelfile is at Qwen3.5-9B/Modelfile. Key settings:

Parameter Value Rationale
FROM qwen3.5:9b Official 6.6 GB Q4_K_M — fits 16 GB with headroom
num_ctx 32768 32K context with lots of headroom on 16 GB
num_gpu 99 Offload all 32 layers to GPU
temperature 0.7 Lower for deterministic, reliable code output
top_p 0.8 Narrow sampling for focused code generation
top_k 20 Standard for coding tasks
min_p 0.0 No minimum probability filter
presence_penalty 0.0 No topic diversity forcing for code
repeat_penalty 1.05 Slight penalty to avoid repetition in code
frequency_penalty 0.0 No frequency penalty
stop <\|im_start\|>, <\|im_end\|> Chat template stop tokens

VRAM BUDGET ANALYSIS

Component Size
qwen3.5:9b weights (Q4_K_M) ~6.6 GB
KV cache (q4_0, 32K context) ~3-4 GB
KV cache (q4_0, 64K context) ~6-8 GB
KV cache (f16, 32K context) ~6-8 GB
Ollama process overhead ~0.5-1 GB
Total (q4_0 + 32K) ~10-11 GB ✅ lots of headroom
Total (q4_0 + 64K) ~13-15 GB ⚠️ tight but fits
Total (f16 + 32K) ~13-15 GB ⚠️ tight but fits

Note: About 1 GB of baseline VRAM is consumed by the desktop environment (X11/Wayland, browser, etc.). Even accounting for this, q4_0 + 32K fits with ~4-5 GB to spare.


COMPLETE SETUP GUIDE

1. Install Ollama

# Ubuntu/Debian
curl -fsSL https://ollama.com/install.sh | sh

# macOS
brew install ollama

# Verify
ollama --version   # Should be 0.30.6 or later

2. Set Environment Variables (Recommended)

# KV Cache at q4_0 — halves memory vs default f16
# Only needed if you want maximum context on 16 GB
export OLLAMA_KV_CACHE_TYPE=q4_0

# Single GPU — no special scheduler needed
# Limit to one parallel request (optional)
export OLLAMA_NUM_PARALLEL=1
export OLLAMA_MAX_LOADED_MODELS=1

Permanent setup (systemd):

sudo systemctl edit ollama

Add:

[Service]
Environment="OLLAMA_KV_CACHE_TYPE=q4_0"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=1"

Then:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Permanent setup (macOS launchd):

launchctl setenv OLLAMA_KV_CACHE_TYPE q4_0
launchctl setenv OLLAMA_NUM_PARALLEL 1
launchctl setenv OLLAMA_MAX_LOADED_MODELS 1
# Quit and restart Ollama from Applications

3. Start the Ollama Server

# Start (if not running via systemd)
ollama serve

Verify it’s running:

curl -s http://127.0.0.1:11434/api/version
# → {"version":"0.30.6"}

4. Verify GPU Detection

# Check Ollama detects your GPU via CUDA
ollama ps  # (run after model is loaded)
# Or check server logs:
journalctl -u ollama --no-pager | grep "inference compute"

Expected output:

inference compute ... library=CUDA compute=8.9 name=CUDA0 description="NVIDIA GeForce RTX 4060 Ti"

If you see “library=Vulkan” instead, ensure: - NVIDIA driver is properly installed (525+) - Ollama was restarted after setting env vars

5. Pull & Create (One-Time Setup)

# Pull the base model (~6.6 GB)
ollama pull qwen3.5:9b

# Create the custom model from the Modelfile
ollama create oamazonasgabriel/qwen3.5-9b:q4-16gbGPU \
  -f Qwen/Qwen3.5-9B/Modelfile

6. Free GPU Memory

⚠️ Other GPU-using applications (LM Studio, training scripts, browsers) consume VRAM. Check with nvidia-smi before loading:

nvidia-smi

If VRAM is occupied, quit competing processes:

kill $(pgrep -f lm-studio) 2>/dev/null

7. Test the Model

Via CLI:

ollama run oamazonasgabriel/qwen3.5-9b:q4-16gbGPU

Via REST API:

curl -s http://127.0.0.1:11434/api/generate \
  -d '{
    "model": "oamazonasgabriel/qwen3.5-9b:q4-16gbGPU",
    "prompt": "Write a function to merge two sorted arrays in Python",
    "stream": false
  }'

Via OpenAI-compatible endpoint:

curl -s http://127.0.0.1:11434/v1/chat/completions \
  -d '{
    "model": "oamazonasgabriel/qwen3.5-9b:q4-16gbGPU",
    "messages": [{"role": "user", "content": "Write a React component"}],
    "stream": false
  }'

USAGE EXAMPLES

CLI

# Interactive chat (coding mode)
ollama run oamazonasgabriel/qwen3.5-9b:q4-16gbGPU

# Single code prompt
ollama run oamazonasgabriel/qwen3.5-9b:q4-16gbGPU \
  "Write a Python function that implements binary search"

# With thinking mode enabled
ollama run oamazonasgabriel/qwen3.5-9b:q4-16gbGPU \
  --system "Always think step by step before answering"

Python

import ollama

# Chat with code generation
response = ollama.chat(
    model='oamazonasgabriel/qwen3.5-9b:q4-16gbGPU',
    messages=[{'role': 'user', 'content': 'Write a fast API endpoint'}],
)
print(response.message.content)

# With thinking mode
response = ollama.chat(
    model='oamazonasgabriel/qwen3.5-9b:q4-16gbGPU',
    messages=[{'role': 'user', 'content': 'Debug this code: ...'}],
    options={'enable_thinking': True},
)

OpenCode Integration

Add to ~/.config/opencode/opencode.jsonc:

"oamazonasgabriel/qwen3.5-9b:q4-16gbGPU": {
  "name": "oamazonasgabriel/qwen3.5-9b:q4-16gbGPU",
  "options": {
    "supportsThinking": true,
    "contextWindow": 32768
  }
}

Use as agent:

"model": "ollama/oamazonasgabriel/qwen3.5-9b:q4-16gbGPU"

CONTEXT WINDOW SCALING

With the 16 GB VRAM budget and q4_0 KV cache:

Context KV Cache Fits 16 GB? Notes
8,192 ~0.7 GB ✅ Lots of headroom ~8 GB total
16,384 ~1.5 GB ✅ Plenty of headroom ~9 GB total
32,768 ~3-4 GB ✅ Recommended ~10-11 GB total
65,536 ~6-8 GB ⚠️ Tight ~13-15 GB total — monitor with nvidia-smi
131,072 ~12-16 GB ❌ No CPU offload required
262,144 (native) ~24+ GB ❌ No Needs 32 GB+ VRAM

PERFORMANCE

On a 16 GB single-GPU setup: - Prompt processing: ~40-60 tok/s (depends on prompt length) - Text generation: ~40-80 tok/s (dense model, all 9B active) - Model load time: ~10-15 seconds (6.6 GB load) - First token latency: Low — model loads fast due to small size

The dense architecture means every token uses the full 9B parameters, giving you maximum quality per response — no routing decisions or expert sparsity.


MONITORING

# Check model is fully GPU-resident
ollama ps
# If CPU% appears, reduce context or switch KV cache type

# Real-time GPU usage
watch -n 1 nvidia-smi

# GPU memory query
nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv

# Ollama server logs (Linux)
journalctl -u ollama -n 50 --no-pager

TROUBLESHOOTING

Model fails to load — “unable to allocate buffer”

Cause: Not enough free VRAM. Another process (LM Studio, browser, Xorg) is consuming GPU memory. Fix: Kill competing GPU processes. Free at least 12 GB before loading.

OOM / Model does not load

This is unlikely on 16 GB with Q4_K_M, but if it happens:

# Try with reduced context
ollama run oamazonasgabriel/qwen3.5-9b:q4-16gbGPU -- --num-ctx 16384

CPU offloading detected

ollama ps shows CPU usage for the model. Reduce num_ctx or ensure OLLAMA_KV_CACHE_TYPE=q4_0 is set.

Thinking mode not working

Thinking mode is off by default in this configuration. Enable it via API:

ollama.chat(..., options={'enable_thinking': True})

Or in CLI: Add “Think step by step” to your prompt.

Slow generation

On 16 GB hardware, you should get 40-80 tok/s. If significantly slower: - Check for CPU offloading (ollama ps) - Ensure GPU is detected via CUDA, not Vulkan - Check another process isn’t consuming GPU compute

Tool calling fails

Ensure you’re using the correct prompt format. This model uses the Qwen chat template with <|im_start|> and <|im_end|> tokens.


PUSH TO REGISTRY

ollama push oamazonasgabriel/qwen3.5-9b:q4-16gbGPU

QUICK REFERENCE

# Full setup from scratch
ollama pull qwen3.5:9b
ollama create oamazonasgabriel/qwen3.5-9b:q4-16gbGPU \
  -f Qwen/Qwen3.5-9B/Modelfile
ollama run oamazonasgabriel/qwen3.5-9b:q4-16gbGPU

# Or just pull from registry (once pushed)
ollama pull oamazonasgabriel/qwen3.5-9b:q4-16gbGPU
ollama run oamazonasgabriel/qwen3.5-9b:q4-16gbGPU

ALTERNATIVE MODELS

If you need more capability or have more VRAM:

# Qwen3.6-35B-A3B MoE (needs 24 GB)
ollama pull oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU

# Higher quality Qwen3.5-9B quant
ollama pull qwen3.5:9b  # Default Q4_K_M
ollama pull hf.co/bartowski/Qwen3.5-9B-Instruct-GGUF:Q8_0  # Near-lossless, ~9 GB

LINKS

Resource URL
This model on Ollama https://ollama.com/oamazonasgabriel/qwen3.5-9b
Upstream model (Ollama) https://ollama.com/library/qwen3.5:9b
Upstream model (HuggingFace) https://huggingface.co/Qwen/Qwen3.5-9B
Ollama documentation https://docs.ollama.com
Contributor ’s GitHub https://github.com/oamazonasgabriel
Built by impacte.tech