223 5 days ago

A lightweight, variant of Qwen3.6-35B-A3B using Q4_K_M quantization. Modelfile Designed to fit within 24 GB total VRAM with a 16K context window.

tools thinking
ollama run oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU

Applications

Claude Code
Claude Code ollama launch claude --model oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU
Codex App
Codex App ollama launch codex-app --model oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU
OpenClaw
OpenClaw ollama launch openclaw --model oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU
Hermes Agent
Hermes Agent ollama launch hermes --model oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU
Codex
Codex ollama launch codex --model oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU
OpenCode
OpenCode ollama launch opencode --model oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU

Models

View all →

Readme

Qwen3.6-35B-A3B β€” Ollama Model (24 GB)

Model: oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU

Ollama


DESCRIPTION

A memory-efficient variant of Qwen3.6-35B-A3B using imatrix-calibrated Q4_K_M quantization. Designed to run on hardware with 24 GB unified memory (single GPU with 24 GB VRAM or Apple Silicon with 24 GB unified memory) with headroom for a 16K context window.

Key Features

  • Mixture-of-Experts (MoE): 34.7B total parameters, only ~3B active per token
  • Fast Generation: ~20-45 tok/s on compatible GPU hardware
  • Tool Calling: Native support via qwen3_coder parser
  • Thinking Mode: Enabled by default
  • Long Context: 262K native (16K configured for VRAM budget)

Architecture

Property Value
Architecture Mixture-of-Experts (MoE)
Total Parameters 34.7B
Active Parameters ~3B (per token)
Experts 256 (8 routed + 1 shared)
Layers 40 (hybrid Gated DeltaNet + Gated Attention + MoE)
Native Context 262,144 tokens
Extended Context up to 1,010,000 via YaRN
Modalities Text (this variant); upstream supports Image + Video
Quantization Q4_K_M / IQ4_XS imatrix-calibrated
Model Size ~18 GB (weights)
License Apache 2.0
Upstream Qwen/Qwen3.6-35B-A3B

REQUIREMENTS

Resource Minimum Recommended
GPU Memory 24 GB VRAM 24 GB+ VRAM
System RAM 32 GB 48 GB
Disk Space 20 GB free 50 GB+ free
NVIDIA Driver 525+ 550+
Ollama Version 0.30.6+ Latest

Platform support: - NVIDIA GPU: Single GPU with 24 GB+ VRAM (e.g., RTX 4090, RTX 5080, A5000, etc.) - Apple Silicon: Mac with 24 GB+ unified memory (M2 Max, M3 Max, M4 Max, etc.) - AMD GPU: ROCm-compatible with 24 GB+ VRAM

πŸ’‘ Why 24 GB? The model weights are ~18 GB at Q4_K_M. With q4_0 KV cache and 16K context, total VRAM usage is ~20-21 GB. The remaining headroom accounts for Ollama overhead and system processes.


QUICK START

1. Install Ollama

# macOS (Homebrew)
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download

2. Set Environment Variables (Recommended)

Set these before starting Ollama to minimize memory usage:

# KV Cache at q4_0 β€” halves memory vs default f16
export OLLAMA_KV_CACHE_TYPE=q4_0

# Limit to one request at a time (memory constraint)
export OLLAMA_NUM_PARALLEL=1
export OLLAMA_MAX_LOADED_MODELS=1

Permanent setup (Linux systemd):

sudo systemctl edit ollama

Add:

[Service]
Environment="OLLAMA_KV_CACHE_TYPE=q4_0"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_LOADED_MODELS=1"

Then:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Permanent setup (macOS launchd):

launchctl setenv OLLAMA_KV_CACHE_TYPE q4_0
launchctl setenv OLLAMA_NUM_PARALLEL 1
launchctl setenv OLLAMA_MAX_LOADED_MODELS 1
# Quit and restart Ollama from Applications

3. Pull & Run

# Pull the model (downloads ~18 GB)
ollama pull oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU

# Run interactively
ollama run oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU

# Single prompt
ollama run oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU "Explain quantum computing in simple terms"

USAGE

CLI

# Interactive chat
ollama run oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU

# With system prompt
ollama run oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU \
  --system "You are a helpful coding assistant"

REST API

# Chat completion
curl -s http://127.0.0.1:11434/api/chat \
  -d '{
    "model": "oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Generate (completion)
curl -s http://127.0.0.1:11434/api/generate \
  -d '{
    "model": "oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU",
    "prompt": "Why is the sky blue?",
    "stream": false,
    "options": { "num_ctx": 1024 }
  }'

# OpenAI-compatible endpoint
curl -s http://127.0.0.1:11434/v1/chat/completions \
  -d '{
    "model": "oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": false
  }'

Python (ollama library)

pip install ollama
import ollama

# Chat
response = ollama.chat(
    model='oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU',
    messages=[{'role': 'user', 'content': 'Hello!'}],
)
print(response.message.content)

# Generate
response = ollama.generate(
    model='oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU',
    prompt='Write a Python function to sort a list',
)
print(response.response)

JavaScript (ollama.js)

npm install ollama
import ollama from 'ollama'

const response = await ollama.chat({
  model: 'oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU',
  messages: [{ role: 'user', content: 'Hello!' }],
})
console.log(response.message.content)

MODEL DETAILS

Sampling Parameters

These are baked into the model via its Modelfile:

Parameter Value Rationale
num_ctx 16384 Safe ceiling with q4_0 KV cache on 24 GB
num_gpu 99 Offload all layers to GPU
temperature 1.0 Qwen team recommended for thinking mode
top_p 0.95 Standard nucleus sampling
top_k 20 Focused token selection
min_p 0.0 No minimum probability filter
presence_penalty 1.5 Encourages topic diversity
repeat_penalty 1.0 No repetition penalty

MEMORY & PERFORMANCE

VRAM Budget

Component Size
Model weights (Q4_K_M) ~18 GB
KV cache (q4_0, 16K context) ~1.5-2 GB
Ollama process overhead ~0.5-1 GB
Total (q4_0 + 16K) ~20-21 GB βœ…
Total (f16 + 16K) ~25-27 GB ❌

⚠️ About 1 GB of baseline VRAM is consumed by the desktop environment (X11/Wayland, browser, etc.). Account for this when checking headroom.

Context Window Scaling

Context KV Cache Fits 24 GB? Notes
8,192 ~0.8-1 GB βœ… Plenty of headroom ~19.5 GB total
16,384 ~1.5-2 GB βœ… Recommended ~21 GB total
32,768 ~3-4 GB ⚠️ Tight ~23 GB total
65,536 ~6-8 GB ❌ No Spills to CPU
131,072 ~12-16 GB ❌ No CPU offload required
262,144 (native) ~24+ GB ❌ No Needs 48 GB+ VRAM

Performance

  • Prompt processing: ~15-25 tok/s (depends on prompt length)
  • Text generation: ~20-45 tok/s (only ~3B active params per token)
  • Model load time: ~50 seconds (~18 GB load)
  • First token latency: Higher on cold start, faster once cached

The MoE architecture means only 8 of 256 experts are activated per token, making generation surprisingly fast for a 35B-parameter model.


MONITORING

# Check model is fully GPU-resident
ollama ps
# If CPU% appears, reduce context or switch KV cache type

# Real-time GPU usage (NVIDIA)
watch -n 1 nvidia-smi

# GPU memory query
nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv

# Ollama server logs (Linux)
journalctl -u ollama -n 50 --no-pager

# Ollama server logs (macOS)
# Check Console.app or ~/.ollama/ logs

TROUBLESHOOTING

Model fails to load β€” β€œunable to allocate buffer”

Cause: Not enough free VRAM. Another process (LM Studio, another model, browser, Xorg) is consuming GPU memory. Fix: Kill competing GPU processes. Free at least 22 GB before loading.

OOM / Model does not load

Switch to a smaller quant:

ollama pull oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU
# Or try a lighter upstream model:
ollama pull batiai/qwen3.6-35b:q3

CPU offloading detected

ollama ps shows CPU usage for the model. Reduce num_ctx or ensure OLLAMA_KV_CACHE_TYPE=q4_0 is set.

Slow generation

Expected for MoE β€” only ~3B params active per token. Typical throughput is ~20-45 tok/s. If significantly slower, check for CPU offloading.

Tool calling fails

Lower quantizations (q3) may produce malformed JSON. This model uses Q4_K_M which reliably handles tool calls.

Vision not supported

This variant is text-only GGUF quantization. For vision support, use the official qwen3.6:35b-a3b model (requires 48 GB+ VRAM or CPU offloading).


Opencode Integration

Add to ~/.config/opencode/opencode.jsonc:

"oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU": {
  "name": "oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU",
  "options": {
    "supportsThinking": false,
    "contextWindow": 16384
  }
}

Use as agent:

"model": "ollama/oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU"

ALTERNATIVE MODELS

If you need more VRAM headroom for longer context:

# APEX I-Compact (17 GB) β€” better quality than Q3_K_M
ollama pull fredrezones55/Qwen3.6-35B-A3B-APEX:I-Compact

# Hugging Face GGUF import for IQ4_XS (17.7 GB)
ollama pull hf.co/bartowski/Qwen_Qwen3.6-35B-A3B-GGUF:IQ4_XS

LINKS

Resource URL
This model on Ollama https://ollama.com/oamazonasgabriel/qwen3.6-35b-a3b
Upstream model (HuggingFace) https://huggingface.co/Qwen/Qwen3.6-35B-A3B
BatAI quantized version https://ollama.com/batiai/qwen3.6-35b
Ollama documentation https://docs.ollama.com
Contributor ’s GitHub https://github.com/oamazonasgabriel
Built by impacte.tech