223 Downloads Updated 5 days ago
ollama run oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU
ollama launch claude --model oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU
ollama launch codex-app --model oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU
ollama launch openclaw --model oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU
ollama launch hermes --model oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU
ollama launch codex --model oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU
ollama launch opencode --model oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU
Model: oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU
A memory-efficient variant of Qwen3.6-35B-A3B using imatrix-calibrated Q4_K_M quantization. Designed to run on hardware with 24 GB unified memory (single GPU with 24 GB VRAM or Apple Silicon with 24 GB unified memory) with headroom for a 16K context window.
| Property | Value |
|---|---|
| Architecture | Mixture-of-Experts (MoE) |
| Total Parameters | 34.7B |
| Active Parameters | ~3B (per token) |
| Experts | 256 (8 routed + 1 shared) |
| Layers | 40 (hybrid Gated DeltaNet + Gated Attention + MoE) |
| Native Context | 262,144 tokens |
| Extended Context | up to 1,010,000 via YaRN |
| Modalities | Text (this variant); upstream supports Image + Video |
| Quantization | Q4_K_M / IQ4_XS imatrix-calibrated |
| Model Size | ~18 GB (weights) |
| License | Apache 2.0 |
| Upstream | Qwen/Qwen3.6-35B-A3B |
| Resource | Minimum | Recommended |
|---|---|---|
| GPU Memory | 24 GB VRAM | 24 GB+ VRAM |
| System RAM | 32 GB | 48 GB |
| Disk Space | 20 GB free | 50 GB+ free |
| NVIDIA Driver | 525+ | 550+ |
| Ollama Version | 0.30.6+ | Latest |
Platform support: - NVIDIA GPU: Single GPU with 24 GB+ VRAM (e.g., RTX 4090, RTX 5080, A5000, etc.) - Apple Silicon: Mac with 24 GB+ unified memory (M2 Max, M3 Max, M4 Max, etc.) - AMD GPU: ROCm-compatible with 24 GB+ VRAM
π‘ Why 24 GB? The model weights are ~18 GB at Q4_K_M. With q4_0 KV cache and 16K context, total VRAM usage is ~20-21 GB. The remaining headroom accounts for Ollama overhead and system processes.
# macOS (Homebrew)
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from https://ollama.com/download
Set these before starting Ollama to minimize memory usage:
# KV Cache at q4_0 β halves memory vs default f16
export OLLAMA_KV_CACHE_TYPE=q4_0
# Limit to one request at a time (memory constraint)
export OLLAMA_NUM_PARALLEL=1
export OLLAMA_MAX_LOADED_MODELS=1
Permanent setup (Linux systemd):
sudo systemctl edit ollama
Add:
[Service]
Environment="OLLAMA_KV_CACHE_TYPE=q4_0"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Then:
sudo systemctl daemon-reload
sudo systemctl restart ollama
Permanent setup (macOS launchd):
launchctl setenv OLLAMA_KV_CACHE_TYPE q4_0
launchctl setenv OLLAMA_NUM_PARALLEL 1
launchctl setenv OLLAMA_MAX_LOADED_MODELS 1
# Quit and restart Ollama from Applications
# Pull the model (downloads ~18 GB)
ollama pull oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU
# Run interactively
ollama run oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU
# Single prompt
ollama run oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU "Explain quantum computing in simple terms"
# Interactive chat
ollama run oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU
# With system prompt
ollama run oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU \
--system "You are a helpful coding assistant"
# Chat completion
curl -s http://127.0.0.1:11434/api/chat \
-d '{
"model": "oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Generate (completion)
curl -s http://127.0.0.1:11434/api/generate \
-d '{
"model": "oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU",
"prompt": "Why is the sky blue?",
"stream": false,
"options": { "num_ctx": 1024 }
}'
# OpenAI-compatible endpoint
curl -s http://127.0.0.1:11434/v1/chat/completions \
-d '{
"model": "oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false
}'
pip install ollama
import ollama
# Chat
response = ollama.chat(
model='oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU',
messages=[{'role': 'user', 'content': 'Hello!'}],
)
print(response.message.content)
# Generate
response = ollama.generate(
model='oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU',
prompt='Write a Python function to sort a list',
)
print(response.response)
npm install ollama
import ollama from 'ollama'
const response = await ollama.chat({
model: 'oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU',
messages: [{ role: 'user', content: 'Hello!' }],
})
console.log(response.message.content)
These are baked into the model via its Modelfile:
| Parameter | Value | Rationale |
|---|---|---|
num_ctx |
16384 | Safe ceiling with q4_0 KV cache on 24 GB |
num_gpu |
99 | Offload all layers to GPU |
temperature |
1.0 | Qwen team recommended for thinking mode |
top_p |
0.95 | Standard nucleus sampling |
top_k |
20 | Focused token selection |
min_p |
0.0 | No minimum probability filter |
presence_penalty |
1.5 | Encourages topic diversity |
repeat_penalty |
1.0 | No repetition penalty |
| Component | Size |
|---|---|
| Model weights (Q4_K_M) | ~18 GB |
| KV cache (q4_0, 16K context) | ~1.5-2 GB |
| Ollama process overhead | ~0.5-1 GB |
| Total (q4_0 + 16K) | ~20-21 GB β |
| Total (f16 + 16K) | ~25-27 GB β |
β οΈ About 1 GB of baseline VRAM is consumed by the desktop environment (X11/Wayland, browser, etc.). Account for this when checking headroom.
| Context | KV Cache | Fits 24 GB? | Notes |
|---|---|---|---|
| 8,192 | ~0.8-1 GB | β Plenty of headroom | ~19.5 GB total |
| 16,384 | ~1.5-2 GB | β Recommended | ~21 GB total |
| 32,768 | ~3-4 GB | β οΈ Tight | ~23 GB total |
| 65,536 | ~6-8 GB | β No | Spills to CPU |
| 131,072 | ~12-16 GB | β No | CPU offload required |
| 262,144 (native) | ~24+ GB | β No | Needs 48 GB+ VRAM |
The MoE architecture means only 8 of 256 experts are activated per token, making generation surprisingly fast for a 35B-parameter model.
# Check model is fully GPU-resident
ollama ps
# If CPU% appears, reduce context or switch KV cache type
# Real-time GPU usage (NVIDIA)
watch -n 1 nvidia-smi
# GPU memory query
nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv
# Ollama server logs (Linux)
journalctl -u ollama -n 50 --no-pager
# Ollama server logs (macOS)
# Check Console.app or ~/.ollama/ logs
Cause: Not enough free VRAM. Another process (LM Studio, another model, browser, Xorg) is consuming GPU memory. Fix: Kill competing GPU processes. Free at least 22 GB before loading.
Switch to a smaller quant:
ollama pull oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU
# Or try a lighter upstream model:
ollama pull batiai/qwen3.6-35b:q3
ollama ps shows CPU usage for the model. Reduce num_ctx or ensure
OLLAMA_KV_CACHE_TYPE=q4_0 is set.
Expected for MoE β only ~3B params active per token. Typical throughput is ~20-45 tok/s. If significantly slower, check for CPU offloading.
Lower quantizations (q3) may produce malformed JSON. This model uses Q4_K_M which reliably handles tool calls.
This variant is text-only GGUF quantization. For vision support, use the
official qwen3.6:35b-a3b model (requires 48 GB+ VRAM or CPU offloading).
Add to ~/.config/opencode/opencode.jsonc:
"oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU": {
"name": "oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU",
"options": {
"supportsThinking": false,
"contextWindow": 16384
}
}
Use as agent:
"model": "ollama/oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU"
If you need more VRAM headroom for longer context:
# APEX I-Compact (17 GB) β better quality than Q3_K_M
ollama pull fredrezones55/Qwen3.6-35B-A3B-APEX:I-Compact
# Hugging Face GGUF import for IQ4_XS (17.7 GB)
ollama pull hf.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF:IQ4_XS
| Resource | URL |
|---|---|
| This model on Ollama | https://ollama.com/oamazonasgabriel/qwen3.6-35b-a3b |
| Upstream model (HuggingFace) | https://huggingface.co/Qwen/Qwen3.6-35B-A3B |
| BatAI quantized version | https://ollama.com/batiai/qwen3.6-35b |
| Ollama documentation | https://docs.ollama.com |
| Contributor βs GitHub | https://github.com/oamazonasgabriel |
| Built by | impacte.tech |