1,203 3 weeks ago

APEX (Adaptive Precision for EXpert Models) quantizations of Qwen/Qwen3.6-35B-A3B, to try to optimize performance.

vision tools thinking
ollama run fredrezones55/Qwen3.6-35B-A3B-APEX:Compact

Details

3 weeks ago

49a41057c900 · 18GB ·

qwen35moe
·
35.1B
·
Q4_K_M
{ "min_p": 0, "presence_penalty": 1.5, "repeat_penalty": 1, "temperature": 1, "t
{{ .Prompt }}

Readme

Ollama patched merge model brought to you by fredrezone55:

Qwen 3.6 35B-A3B APEX

APEX (Adaptive Precision for EXpert Models) quantizations of Qwen/Qwen3.6-35B-A3B.

Brought to you by the LocalAI team | APEX Project | Technical Report

Benchmark Results

All benchmarks run with llama.cpp b8797 on NVIDIA GB10 (122 GB VRAM). Perplexity and KL divergence measured on wikitext-2. HellaSwag zero-shot (400 tasks). KL divergence computed against BF16 reference logits.

APEX vs Baselines (unsloth UD quants)

Model Size PPL ↓ KL mean ↓ KL median ↓ KL max ↓ HellaSwag ↑
BF16 (reference) 65 GB 6.722
Q8_0 35 GB 6.720 0.0059 0.0022 9.72 82.5%
UD-Q5_K_XL 25 GB 6.725 0.0083 0.0030 9.06 82.8%
UD-Q5_K_S 24 GB 6.728 0.0095 0.0035 8.72 82.8%
APEX I-Balanced 24 GB 6.727 0.0103 0.0041 4.53 83.0%
APEX Balanced 24 GB 6.726 0.0117 0.0047 14.14 83.0%
APEX I-Quality 22 GB 6.735 0.0141 0.0054 5.69 82.5%
APEX Quality 22 GB 6.753 0.0155 0.0060 13.01 82.8%
UD-Q4_K_XL 21 GB 6.735 0.0134 0.0050 5.14 82.3%
UD-Q4_K_M 21 GB 6.736 0.0138 0.0054 7.86 83.3%
APEX I-Compact 17 GB 6.857 0.0451 0.0182 8.76 83.5%
APEX Compact 17 GB 6.862 0.0614 0.0261 17.58 83.3%
UD-Q3_K_M 16 GB 6.883 0.0435 0.0163 9.37 82.8%
APEX I-Mini 14 GB 7.238 0.0999 0.0414 9.21 82.8%

Complete Benchmark Summary KL Max Comparison APEX vs Baselines

Highlights

  • APEX I-Balanced (24 GB) achieves the lowest KL max (4.53) of any quant tested — even lower than Q8_0 (9.72). The imatrix dramatically reduces worst-case divergence while matching UD-Q5_K_S on perplexity.
  • At 17 GB, APEX I-Compact beats UD-Q3_K_M (16 GB) on PPL (6.857 vs 6.883) and HellaSwag (83.5% vs 82.8%).
  • imatrix consistently halves KL max: I-Balanced 4.53 vs Balanced 14.14, I-Quality 5.69 vs Quality 13.01.
  • APEX I-Mini (14 GB) delivers usable quality (PPL 7.24, HellaSwag 82.8%) in the smallest package.

Available Files

File Profile Size Best For
Qwen3.6-35B-A3B-APEX-I-Balanced.gguf I-Balanced 24 GB Best overall — lowest KL max of any quant
Qwen3.6-35B-A3B-APEX-I-Quality.gguf I-Quality 22 GB Highest quality with imatrix, 2 GB smaller
Qwen3.6-35B-A3B-APEX-Quality.gguf Quality 22 GB Highest quality standard
Qwen3.6-35B-A3B-APEX-Balanced.gguf Balanced 24 GB General purpose
Qwen3.6-35B-A3B-APEX-I-Compact.gguf I-Compact 17 GB Consumer GPUs, beats UD-Q3_K_M quality
Qwen3.6-35B-A3B-APEX-Compact.gguf Compact 17 GB Consumer GPUs
Qwen3.6-35B-A3B-APEX-I-Mini.gguf I-Mini 14 GB Smallest viable, fastest inference
mmproj.gguf Vision projector ~1 GB Required for image understanding

What is APEX?

APEX is a quantization strategy for Mixture-of-Experts (MoE) models. It classifies tensors by role (routed expert, shared expert, attention) and applies a layer-wise precision gradient — edge layers get higher precision, middle layers get more aggressive compression. I-variants use diverse imatrix calibration (chat, code, reasoning, tool-calling, agentic traces, Wikipedia).

The key insight: in MoE models, expert FFN tensors make up the bulk of model weight but only ~8256 experts activate per token. APEX compresses middle-layer experts more aggressively while preserving edge layers (first/last 5) and keeping attention, SSM/Mamba, and shared expert tensors at higher precision.

See the APEX project for full details, technical report, and scripts.

Architecture

  • Model: Qwen 3.6 35B-A3B (Qwen/Qwen3.6-35B-A3B)
  • Layers: 40
  • Experts: 256 routed + shared (8 active per token)
  • Total Parameters: ~35B
  • Active Parameters: ~3B per token
  • Attention: Hybrid (full attention every 4th layer, linear/Mamba otherwise)
  • Vision: Built-in vision encoder (mmproj included)
  • APEX Config: 5+5 symmetric edge gradient across 40 layers
  • Calibration: v1.3 diverse dataset (chat, code, reasoning, multilingual, tool-calling, Wikipedia)
  • llama.cpp: Built with b8797

Run with LocalAI

local-ai run mudler/Qwen3.6-35B-A3B-APEX-GGUF@Qwen3.6-35B-A3B-APEX-I-Balanced.gguf

Credits

APEX is brought to you by the LocalAI team. Developed through human-driven, AI-assisted research. Built on llama.cpp.