A memory-efficient model configuration of Qwen3.6-35B-A3B using an upstream imatrix-calibrated IQ4_XS quantization and q4_0 KV cache. Designed for 24 GB VRAM

Applications

Claude Code ollama launch claude --model oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU

OpenCode ollama launch opencode --model oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU

Hermes Agent ollama launch hermes --model oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU

OpenClaw ollama launch openclaw --model oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU

Qwen3.6-35B-A3B — Ollama Model (24 GB)

Model: oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU

DESCRIPTION

A memory-efficient model configuration of Qwen3.6-35B-A3B using an upstream imatrix-calibrated IQ4_XS quantization. Designed to run on hardware with 24 GB unified memory (single GPU with 24 GB VRAM or Apple Silicon with 24 GB unified memory) with headroom for a 16K context window.

Key Features

Mixture-of-Experts (MoE): 34.7B total parameters, only ~3B active per token
Fast Generation: ~20-45 tok/s on compatible GPU hardware
Tool Calling: Native support via qwen3_coder parser
Thinking Mode: Enabled by default
Long Context: 262K native (16K configured for VRAM budget)

Architecture

Property	Value
Architecture	Mixture-of-Experts (MoE)
Total Parameters	34.7B
Active Parameters	~3B (per token)
Experts	256 (8 routed + 1 shared)
Layers	40 (hybrid Gated DeltaNet + Gated Attention + MoE)
Native Context	262,144 tokens
Extended Context	up to 1,010,000 via YaRN
Modalities	Text (this variant); original upstream supports Image + Video
Quantization	IQ4_XS imatrix-calibrated
Model Size	~18 GB (weights)
License	Apache 2.0
Original Upstream	Qwen/Qwen3.6-35B-A3B
Quantization Upstream	BatiAI/Qwen3.6-35B-A3B

REQUIREMENTS

Resource	Minimum	Recommended
GPU Memory	24 GB VRAM	24 GB+ VRAM
System RAM	32 GB	48 GB
Disk Space	20 GB free	50 GB+ free
NVIDIA Driver	525+	550+
Ollama Version	0.30.6+	Latest

Platform support: - NVIDIA GPU: Single GPU with 24 GB+ VRAM (e.g., RTX 4090, RTX 5080, A5000, etc.) - Apple Silicon: Mac with 24 GB+ unified memory (M2 Max, M3 Max, M4 Max, etc.) - AMD GPU: ROCm-compatible with 24 GB+ VRAM

💡 Why 24 GB? The model weights are ~18 GB at IQ4_XS. With q4_0 KV cache and 16K context, total VRAM usage is ~20-21 GB. The remaining headroom accounts for Ollama overhead and system processes.

BENCHMARKS

Benchmark Comparison: Qwen3.5-9B (Q4_K_M) vs Qwen3.6-35B (IQ4_XS)

Results from local Ollama benchmarking on dual-GPU setup (16GB + 8GB).

1. HumanEval-Instruct (Python Code Generation)

Metric	Qwen3.5-9B (Q4_K_M)	Qwen3.6-35B (IQ4_XS)	Winner
pass@1 (164 samples)	0.866 (±0.027)	0.689 (±0.036)	9B
pass@1 (20 samples)	0.950 (±0.050)	0.950 (±0.050)	tie

Bottom line: The 9B dense model with Q4_K_M quantization writes significantly more correct Python code — 86.6% vs 68.9% pass@1. The 35B MoE with aggressive IQ4_XS quantization loses code-generation quality.

2. MongoDB 8.0 Developer Questions (100 questions, all 14 categories)

Text-only benchmark with LLM-as-judge (independent model: gemma-4-31b-it via OpenRouter). Scores are 1–5.

Dimension	Qwen3.5-9B (Q4_K_M)	Qwen3.6-35B (IQ4_XS)	Winner
Judge Overall	3.67	4.53	35B
Judge Factual	3.10	4.33	35B
Judge Code Quality	3.03	4.38	35B
Judge Completeness	4.02	4.56	35B
Judge Clarity	4.54	4.86	35B

Performance metrics (100 questions):

Metric	Qwen3.5-9B (Q4_K_M)	Qwen3.6-35B (IQ4_XS)
Avg Response Time	20.7s	20.4s
Avg Output Tokens	1,247	1,855
Avg TTFT	7.3s	13.5s

Bottom line: When tested across all 14 MongoDB categories (not just CRUD), the 35B wins every judge dimension by a wide margin. The 9B’s strong CRUD-only performance didn’t generalize to harder topics like Sharding, Change Streams, and MongoDB 8.0 features.

3. MongoDB 8.0 — Live Code Execution (10 CRUD questions)

With live MongoDB 8.0 Docker container executing generated code.

Metric	Qwen3.5-9B (Q4_K_M)	Qwen3.6-35B (IQ4_XS)	Winner
Code Exec Success Rate	40.0%	57.1%	35B
Avg Response Time	20.6s	5.9s	35B
Avg Output Tokens	1,258	416	35B
Judge Overall	4.78	4.65	9B

Note: On CRUD-only questions, the 9B had higher judge scores but the 35B was faster and its code ran more often.

QUICK START

1. Install Ollama

# macOS (Homebrew)
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download

2. Set Environment Variables (Recommended)

Set these before starting Ollama to minimize memory usage:

# KV Cache at q4_0 — halves memory vs default f16
export OLLAMA_KV_CACHE_TYPE=q4_0

# Limit to one request at a time (memory constraint)
export OLLAMA_NUM_PARALLEL=1
export OLLAMA_MAX_LOADED_MODELS=1

Permanent setup (Linux systemd):

sudo systemctl edit ollama

Add:

[Service]
Environment="OLLAMA_KV_CACHE_TYPE=q4_0"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=1"

Then:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Permanent setup (macOS launchd):

launchctl setenv OLLAMA_KV_CACHE_TYPE q4_0
launchctl setenv OLLAMA_NUM_PARALLEL 1
launchctl setenv OLLAMA_MAX_LOADED_MODELS 1
# Quit and restart Ollama from Applications

3. Pull & Run

# Pull the model (downloads ~18 GB)
ollama pull oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU

# Run interactively
ollama run oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU

# Single prompt
ollama run oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU "Explain quantum computing in simple terms"

USAGE

CLI

# Interactive chat
ollama run oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU

# With system prompt
ollama run oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU \
  --system "You are a helpful coding assistant"

REST API

# Chat completion
curl -s http://127.0.0.1:11434/api/chat \
  -d '{
    "model": "oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Generate (completion)
curl -s http://127.0.0.1:11434/api/generate \
  -d '{
    "model": "oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU",
    "prompt": "Why is the sky blue?",
    "stream": false,
    "options": { "num_ctx": 1024 }
  }'

# OpenAI-compatible endpoint
curl -s http://127.0.0.1:11434/v1/chat/completions \
  -d '{
    "model": "oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }'

Python (ollama library)

pip install ollama

import ollama

# Chat
response = ollama.chat(
    model='oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU',
    messages=[{'role': 'user', 'content': 'Hello!'}],
)
print(response.message.content)

# Generate
response = ollama.generate(
    model='oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU',
    prompt='Write a Python function to sort a list',
)
print(response.response)

JavaScript (ollama.js)

npm install ollama

import ollama from 'ollama'

const response = await ollama.chat({
  model: 'oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU',
  messages: [{ role: 'user', content: 'Hello!' }],
})
console.log(response.message.content)

MODEL DETAILS

Sampling Parameters

These are baked into the model via its Modelfile:

Parameter	Value	Rationale
`num_ctx`	16384	Safe ceiling with q4_0 KV cache on 24 GB
`num_gpu`	99	Offload all layers to GPU
`temperature`	1.0	Qwen team recommended for thinking mode
`top_p`	0.95	Standard nucleus sampling
`top_k`	20	Focused token selection
`min_p`	0.0	No minimum probability filter
`presence_penalty`	1.5	Encourages topic diversity
`repeat_penalty`	1.0	No repetition penalty

MEMORY & PERFORMANCE

VRAM Budget

Component	Size
Model weights (IQ4_XS)	~18 GB
KV cache (q4_0, 16K context)	~1.5-2 GB
Ollama process overhead	~0.5-1 GB
Total (q4_0 + 16K)	~20-21 GB ✅
Total (f16 + 16K)	~25-27 GB ❌

⚠️ About 1 GB of baseline VRAM is consumed by the desktop environment (X11/Wayland, browser, etc.). Account for this when checking headroom.

Context Window Scaling

Context	KV Cache	Fits 24 GB?	Notes
8,192	~0.8-1 GB	✅ Plenty of headroom	~19.5 GB total
16,384	~1.5-2 GB	✅ Recommended	~21 GB total
32,768	~3-4 GB	⚠️ Tight	~23 GB total
65,536	~6-8 GB	❌ No	Spills to CPU
131,072	~12-16 GB	❌ No	CPU offload required
262,144 (native)	~24+ GB	❌ No	Needs 48 GB+ VRAM

Performance

Prompt processing: ~15-25 tok/s (depends on prompt length)
Text generation: ~20-45 tok/s (only ~3B active params per token)
Model load time: ~50 seconds (~18 GB load)
First token latency: Higher on cold start, faster once cached

The MoE architecture means only 8 of 256 experts are activated per token, making generation surprisingly fast for a 35B-parameter model.

MONITORING

# Check model is fully GPU-resident
ollama ps
# If CPU% appears, reduce context or switch KV cache type

# Real-time GPU usage (NVIDIA)
watch -n 1 nvidia-smi

# GPU memory query
nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv

# Ollama server logs (Linux)
journalctl -u ollama -n 50 --no-pager

# Ollama server logs (macOS)
# Check Console.app or ~/.ollama/ logs

TROUBLESHOOTING

Model fails to load — “unable to allocate buffer”

Cause: Not enough free VRAM. Another process (LM Studio, another model, browser, Xorg) is consuming GPU memory. Fix: Kill competing GPU processes. Free at least 22 GB before loading.

OOM / Model does not load

Switch to a smaller quant:

ollama pull oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU
# Or try a lighter upstream model:
ollama pull batiai/qwen3.6-35b:q3

CPU offloading detected

ollama ps shows CPU usage for the model. Reduce num_ctx or ensure OLLAMA_KV_CACHE_TYPE=q4_0 is set.

Slow generation

Expected for MoE — only ~3B params active per token. Typical throughput is ~20-45 tok/s. If significantly slower, check for CPU offloading.

Tool calling fails

Lower quantizations (q3) may produce malformed JSON. This model uses Q4_K_M which reliably handles tool calls.

Vision not supported

This variant is text-only GGUF quantization. For vision support, use the official qwen3.6:35b-a3b model (requires 48 GB+ VRAM or CPU offloading).

Opencode Integration

Add to ~/.config/opencode/opencode.jsonc:

"oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU": {
  "name": "oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU",
  "options": {
    "supportsThinking": false,
    "contextWindow": 16384
  }
}

Use as agent:

"model": "ollama/oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU"

ALTERNATIVE MODELS

If you need more VRAM headroom for longer context:

# APEX I-Compact (17 GB) — better quality than Q3_K_M
ollama pull fredrezones55/Qwen3.6-35B-A3B-APEX:I-Compact

# Hugging Face GGUF import for IQ4_XS (17.7 GB)
ollama pull hf.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF:IQ4_XS

LINKS AND ACKNOWLEDGMENTS

Resource	URL
This model on Ollama	https://ollama.com/oamazonasgabriel/qwen3.6-35b-a3b
Upstream Original model (HuggingFace)	https://huggingface.co/Qwen/Qwen3.6-35B-A3B
Upstream BatiAI quantized version	https://ollama.com/batiai/qwen3.6-35b
Ollama documentation	https://docs.ollama.com
Contributor ’s GitHub	https://github.com/oamazonasgabriel
Built with love by	impacte.tech