Details

Updated 1 month ago

1 month ago

982c562bad1d · 9.2GB ·

model

archgemma4

parameters19.9B

quantizationIQ3_XS

9.2GB

params

{ "num_ctx": 256000, "repeat_last_n": 256, "repeat_penalty": 1.15, "stop": [

150B

template

{{- if or .System .Tools }}<bos><|turn>system {{ if .System }}{{ .System }} {{ end }}{{- if .Tools }

1.3kB

Gemma 4 26B-A4B 98e v4 — multi-class CD-map expert prune

20.8B parameters · 98 experts (30 dropped) · multi-class ContribDynamic drop map

Research checkpoint that takes Gemma-4-26B-A4B-it and drops 30 of 128 experts per layer using a multi-class CD-map (max-over-normalized-classes) recipe. Five task classes — math, logic, code, science, creative — are scored per layer; an expert is kept if it scores high on any single class, rescuing specialists that a pooled single-class score would have dropped. Same router, same attention, same norms as base — only the expert keep-set changes.

Full model card, methodology, drop-map rationale, ablations: ManniX-ITA/gemma-4-A4B-98e-v4-it on Hugging Face.

Other formats

Format	Repo	Notes
GGUF (this repo, llama.cpp / ollama)	`ManniX-ITA/gemma-4-A4B-98e-v4-it-GGUF`	Bartowski tier sweep (Q2_K → Q8_0, IQ-series) + 5 ContribDynamic CD-* per-layer quants. F16 baseline included.
NVFP4A16 (vLLM)	`ManniX-ITA/gemma-4-A4B-98e-v4-NVFP4A16`	~13 GB, native vLLM, produced via `modelopt==0.43.0`.
BF16 source weights	`ManniX-ITA/gemma-4-A4B-98e-v4-it`	20.8B bf16; base for any further surgery / quant.

When to use this vs. v5-coder

Pick v4 for: general-purpose chat, knowledge recall, instruction-following, mixed workloads. Balanced across the 5 task classes by construction.

Pick v5-coder for: Python / JS / Rust code generation, HumanEval / LCB workloads, MATH-500-class problems. Same parameter count, code-targeted drop map (C6 layer-relevance-weighted v4-floor, breadth=50) — wins on every code bench and on MATH-500.

Quick start

# recommended default for most setups (≈14 GB VRAM)
ollama pull mannix/gemma4-98e-v4:Q4_K_M

# best quality at moderate size (≈17 GB VRAM)
ollama pull mannix/gemma4-98e-v4:Q6_K

# size-conscious (≈8 GB VRAM) — minimal quality loss
ollama pull mannix/gemma4-98e-v4:CD-Q3_K_M

CD-* variants are ContribDynamic per-layer mixed quants — expert layers get more bits, attention/norm less. Roughly 5–10% faster than the matching plain quant at similar quality.

Scores

NVFP4A16, vLLM, greedy decoding, thinking-token budget 12 288. Apples-to-apples against the 128e reference (Gemma-4-26B-A4B-it) on the same harness.

Benchmark (n)	128e ref	98e v4	Δ (v4 − 128e)
HumanEval-164 chat (pass@1)	96.95	96.95	0.00
HumanEval+-164 chat (pass@1)	92.07	91.46	−0.61
MATH-500-100 (math_verify)	89.00	89.00	0.00
AIME 2024 (30)	36.67	36.67	0.00
IFEval-100 (prompt_strict)	95.00	93.00	−2.00
GSM8K-100 (flex)	91.00	86.00	−5.00
GPQA Diamond (198, flex)	73.23	69.19	−4.04
LCB-medium-55 v4 (pass@1)	87.27	78.18	−9.09

Reading the deltas: code chat (HE/HE+) and pure math (MATH-500, AIME) hold at 128e parity. Knowledge / multi-step reasoning (GPQA, GSM8K) and competitive code (LCB-medium) take a measured hit — the cost of dropping 23.4% of experts with a class-balanced drop map. Use v5-coder if those code/math axes matter for your workload.

Template & parameters

Uses the Gemma 4 chat template with tool-use support and a 2nd-turn workaround for nested function calls. Default parameters baked into every tag:

PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER repeat_penalty 1.15
PARAMETER repeat_last_n 256
PARAMETER num_ctx 256000
PARAMETER stop <turn|>
PARAMETER stop <|tool_response>

License

Gemma Terms of Use. Use of this model implies acceptance.

v5-coder (code-axis drop map, code/math-leaning): mannix/gemma4-98e-v5-coder
v3 (pooled teacher-force drop map, earlier variant): mannix/gemma4-98e
Project / scripts: github.com/mann1x/omnimergekit

Pruned to 98 experts gemma-4 a4b 26b v4