5 Downloads Updated 5 hours ago
ollama run oamazonasgabriel/qwen3.5-9b:q4-16gbGPU
ollama launch claude --model oamazonasgabriel/qwen3.5-9b:q4-16gbGPU
ollama launch codex-app --model oamazonasgabriel/qwen3.5-9b:q4-16gbGPU
ollama launch openclaw --model oamazonasgabriel/qwen3.5-9b:q4-16gbGPU
ollama launch hermes --model oamazonasgabriel/qwen3.5-9b:q4-16gbGPU
ollama launch codex --model oamazonasgabriel/qwen3.5-9b:q4-16gbGPU
ollama launch opencode --model oamazonasgabriel/qwen3.5-9b:q4-16gbGPU
A coding-optimized configuration of Qwen3.5-9B designed for 16 GB single-GPU hardware. The model uses the official Q4_K_M quantization (~6.6 GB weights), leaving ~9 GB headroom for KV cache — enabling 32K+ context windows comfortably.
Model: oamazonasgabriel/qwen3.5-9b:q4-16gbGPU
enable_thinking API parameter| GPU | Memory | Example Models |
|---|---|---|
| Single GPU | 16 GB VRAM | RTX 4060 Ti 16GB, RTX 5060 16GB, RTX 4080, A4000, etc. |
| Apple Silicon | 16 GB+ unified | M2/M3/M4 Pro or Max |
Requires NVIDIA driver 525+ (550+ recommended).
Qwen3.5-9B is the sweet spot for 16 GB hardware:
| Quantization | Weight Size | VRAM Needed | Context (q4_0) | Fits 16 GB? |
|---|---|---|---|---|
| Q4_K_M (default) | ~6.6 GB | ~8 GB | 32K+ ✅ | ✅ Yes — recommended |
| Q8_0 | ~9-10 GB | ~11 GB | 16K-32K ✅ | ✅ Yes |
| F16 (BF16) | ~18 GB | ~20 GB | — | ❌ No |
The official qwen3.5:9b at Q4_K_M is the ideal choice — small enough to leave
massive headroom, yet capable enough to beat models 3x its size on coding and
reasoning benchmarks.
The custom Modelfile is at Qwen3.5-9B/Modelfile. Key settings:
| Parameter | Value | Rationale |
|---|---|---|
FROM |
qwen3.5:9b |
Official 6.6 GB Q4_K_M — fits 16 GB with headroom |
num_ctx |
32768 | 32K context with lots of headroom on 16 GB |
num_gpu |
99 | Offload all 32 layers to GPU |
temperature |
0.7 | Lower for deterministic, reliable code output |
top_p |
0.8 | Narrow sampling for focused code generation |
top_k |
20 | Standard for coding tasks |
min_p |
0.0 | No minimum probability filter |
presence_penalty |
0.0 | No topic diversity forcing for code |
repeat_penalty |
1.05 | Slight penalty to avoid repetition in code |
frequency_penalty |
0.0 | No frequency penalty |
stop |
<\|im_start\|>, <\|im_end\|> |
Chat template stop tokens |
| Component | Size |
|---|---|
qwen3.5:9b weights (Q4_K_M) |
~6.6 GB |
| KV cache (q4_0, 32K context) | ~3-4 GB |
| KV cache (q4_0, 64K context) | ~6-8 GB |
| KV cache (f16, 32K context) | ~6-8 GB |
| Ollama process overhead | ~0.5-1 GB |
| Total (q4_0 + 32K) | ~10-11 GB ✅ lots of headroom |
| Total (q4_0 + 64K) | ~13-15 GB ⚠️ tight but fits |
| Total (f16 + 32K) | ~13-15 GB ⚠️ tight but fits |
Note: About 1 GB of baseline VRAM is consumed by the desktop environment (X11/Wayland, browser, etc.). Even accounting for this, q4_0 + 32K fits with ~4-5 GB to spare.
# Ubuntu/Debian
curl -fsSL https://ollama.com/install.sh | sh
# macOS
brew install ollama
# Verify
ollama --version # Should be 0.30.6 or later
# KV Cache at q4_0 — halves memory vs default f16
# Only needed if you want maximum context on 16 GB
export OLLAMA_KV_CACHE_TYPE=q4_0
# Single GPU — no special scheduler needed
# Limit to one parallel request (optional)
export OLLAMA_NUM_PARALLEL=1
export OLLAMA_MAX_LOADED_MODELS=1
Permanent setup (systemd):
sudo systemctl edit ollama
Add:
[Service]
Environment="OLLAMA_KV_CACHE_TYPE=q4_0"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Then:
sudo systemctl daemon-reload
sudo systemctl restart ollama
Permanent setup (macOS launchd):
launchctl setenv OLLAMA_KV_CACHE_TYPE q4_0
launchctl setenv OLLAMA_NUM_PARALLEL 1
launchctl setenv OLLAMA_MAX_LOADED_MODELS 1
# Quit and restart Ollama from Applications
# Start (if not running via systemd)
ollama serve
Verify it’s running:
curl -s http://127.0.0.1:11434/api/version
# → {"version":"0.30.6"}
# Check Ollama detects your GPU via CUDA
ollama ps # (run after model is loaded)
# Or check server logs:
journalctl -u ollama --no-pager | grep "inference compute"
Expected output:
inference compute ... library=CUDA compute=8.9 name=CUDA0 description="NVIDIA GeForce RTX 4060 Ti"
If you see “library=Vulkan” instead, ensure: - NVIDIA driver is properly installed (525+) - Ollama was restarted after setting env vars
# Pull the base model (~6.6 GB)
ollama pull qwen3.5:9b
# Create the custom model from the Modelfile
ollama create oamazonasgabriel/qwen3.5-9b:q4-16gbGPU \
-f Qwen/Qwen3.5-9B/Modelfile
⚠️ Other GPU-using applications (LM Studio, training scripts, browsers) consume VRAM. Check with
nvidia-smibefore loading:
nvidia-smi
If VRAM is occupied, quit competing processes:
kill $(pgrep -f lm-studio) 2>/dev/null
ollama run oamazonasgabriel/qwen3.5-9b:q4-16gbGPU
curl -s http://127.0.0.1:11434/api/generate \
-d '{
"model": "oamazonasgabriel/qwen3.5-9b:q4-16gbGPU",
"prompt": "Write a function to merge two sorted arrays in Python",
"stream": false
}'
curl -s http://127.0.0.1:11434/v1/chat/completions \
-d '{
"model": "oamazonasgabriel/qwen3.5-9b:q4-16gbGPU",
"messages": [{"role": "user", "content": "Write a React component"}],
"stream": false
}'
# Interactive chat (coding mode)
ollama run oamazonasgabriel/qwen3.5-9b:q4-16gbGPU
# Single code prompt
ollama run oamazonasgabriel/qwen3.5-9b:q4-16gbGPU \
"Write a Python function that implements binary search"
# With thinking mode enabled
ollama run oamazonasgabriel/qwen3.5-9b:q4-16gbGPU \
--system "Always think step by step before answering"
import ollama
# Chat with code generation
response = ollama.chat(
model='oamazonasgabriel/qwen3.5-9b:q4-16gbGPU',
messages=[{'role': 'user', 'content': 'Write a fast API endpoint'}],
)
print(response.message.content)
# With thinking mode
response = ollama.chat(
model='oamazonasgabriel/qwen3.5-9b:q4-16gbGPU',
messages=[{'role': 'user', 'content': 'Debug this code: ...'}],
options={'enable_thinking': True},
)
Add to ~/.config/opencode/opencode.jsonc:
"oamazonasgabriel/qwen3.5-9b:q4-16gbGPU": {
"name": "oamazonasgabriel/qwen3.5-9b:q4-16gbGPU",
"options": {
"supportsThinking": true,
"contextWindow": 32768
}
}
Use as agent:
"model": "ollama/oamazonasgabriel/qwen3.5-9b:q4-16gbGPU"
With the 16 GB VRAM budget and q4_0 KV cache:
| Context | KV Cache | Fits 16 GB? | Notes |
|---|---|---|---|
| 8,192 | ~0.7 GB | ✅ Lots of headroom | ~8 GB total |
| 16,384 | ~1.5 GB | ✅ Plenty of headroom | ~9 GB total |
| 32,768 | ~3-4 GB | ✅ Recommended | ~10-11 GB total |
| 65,536 | ~6-8 GB | ⚠️ Tight | ~13-15 GB total — monitor with nvidia-smi |
| 131,072 | ~12-16 GB | ❌ No | CPU offload required |
| 262,144 (native) | ~24+ GB | ❌ No | Needs 32 GB+ VRAM |
On a 16 GB single-GPU setup: - Prompt processing: ~40-60 tok/s (depends on prompt length) - Text generation: ~40-80 tok/s (dense model, all 9B active) - Model load time: ~10-15 seconds (6.6 GB load) - First token latency: Low — model loads fast due to small size
The dense architecture means every token uses the full 9B parameters, giving you maximum quality per response — no routing decisions or expert sparsity.
# Check model is fully GPU-resident
ollama ps
# If CPU% appears, reduce context or switch KV cache type
# Real-time GPU usage
watch -n 1 nvidia-smi
# GPU memory query
nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv
# Ollama server logs (Linux)
journalctl -u ollama -n 50 --no-pager
Cause: Not enough free VRAM. Another process (LM Studio, browser, Xorg) is consuming GPU memory. Fix: Kill competing GPU processes. Free at least 12 GB before loading.
This is unlikely on 16 GB with Q4_K_M, but if it happens:
# Try with reduced context
ollama run oamazonasgabriel/qwen3.5-9b:q4-16gbGPU -- --num-ctx 16384
ollama ps shows CPU usage for the model. Reduce num_ctx or ensure
OLLAMA_KV_CACHE_TYPE=q4_0 is set.
Thinking mode is off by default in this configuration. Enable it via API:
ollama.chat(..., options={'enable_thinking': True})
Or in CLI: Add “Think step by step” to your prompt.
On 16 GB hardware, you should get 40-80 tok/s. If significantly slower:
- Check for CPU offloading (ollama ps)
- Ensure GPU is detected via CUDA, not Vulkan
- Check another process isn’t consuming GPU compute
Ensure you’re using the correct prompt format. This model uses the Qwen chat
template with <|im_start|> and <|im_end|> tokens.
ollama push oamazonasgabriel/qwen3.5-9b:q4-16gbGPU
# Full setup from scratch
ollama pull qwen3.5:9b
ollama create oamazonasgabriel/qwen3.5-9b:q4-16gbGPU \
-f Qwen/Qwen3.5-9B/Modelfile
ollama run oamazonasgabriel/qwen3.5-9b:q4-16gbGPU
# Or just pull from registry (once pushed)
ollama pull oamazonasgabriel/qwen3.5-9b:q4-16gbGPU
ollama run oamazonasgabriel/qwen3.5-9b:q4-16gbGPU
If you need more capability or have more VRAM:
# Qwen3.6-35B-A3B MoE (needs 24 GB)
ollama pull oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU
# Higher quality Qwen3.5-9B quant
ollama pull qwen3.5:9b # Default Q4_K_M
ollama pull hf.co/bartowski/Qwen3.5-9B-Instruct-GGUF:Q8_0 # Near-lossless, ~9 GB
| Resource | URL |
|---|---|
| This model on Ollama | https://ollama.com/oamazonasgabriel/qwen3.5-9b |
| Upstream model (Ollama) | https://ollama.com/library/qwen3.5:9b |
| Upstream model (HuggingFace) | https://huggingface.co/Qwen/Qwen3.5-9B |
| Ollama documentation | https://docs.ollama.com |
| Contributor ’s GitHub | https://github.com/oamazonasgabriel |
| Built by | impacte.tech |