jthomas/coe-gemma4-math-hc-14b-a4b:q4_K

jthomas/ coe-gemma4-math-hc-14b-a4b:q4_K_M

Updated 7 hours ago

vision tools thinking

readme

1e5fe3bf2d1f · 25kB

# Gemma4 College of Experts — Math Specialist

**Base model:** [google/gemma-4-26b-it](https://huggingface.co/google/gemma-4-26b-it)

**Architecture:** MoE — 26B total / ≈4B active parameters (1 shared expert + 8 routed from a pool of 128 per MoE layer, 30 MoE layers)

**Method:** Activation-directed expert surgery — 128 → 64 experts per layer (50% reduction)

**Quantization:** Q4_K_M (≈9.7 GB on disk)

**HF:** `JThomas-CoE/coe-gemma4-math-hc-14b-a4b-q4` | **Ollama:** `coe-gemma4-math-14b-a4b:q4`

**Hub:** [JThomas-CoE on HuggingFace](https://huggingface.co/JThomas-CoE)

> **HC (Hand-Curated)** — The activation profiling corpus used to build this model's expert mask was assembled by hand, selecting high-quality domain-representative text from textbooks, problem sets, and technical references. This contrasts with the MMLU-Pro variants (`-mmlu_pro-` in the repo name) which were profiled on stratified subsets of the TIGER-Lab/MMLU-Pro benchmark dataset.

---

## ⚠️ Beta Release — Safety Disclaimer

**These models are beta releases and should be treated as research artifacts, not production-ready systems.**

Expert surgery selects and retains domain-relevant experts based on activation patterns observed during profiling. The pruning pipeline is designed solely to create a coherent domain specialist — it has no mechanism to identify which experts contribute to model alignment, ethical reasoning, or safety guardrails. As a result, experts responsible for enforcing those behaviours may have been inadvertently removed during the surgery process.

**Appropriate use of any model in the College of Experts family is the sole responsibility of the end user.** The authors make no representation that these models retain the safety properties of the parent `google/gemma-4-26b-it` model, and users should not rely on them as a substitute for models that have undergone safety evaluation.

---

## ⚠️ Critical Usage Note — Think-Off Mode

**All models in this series must be used in thinking-off mode.**

If you are using the Ollama API, pass `"think": false` in your request body. If you are accessing the model via a raw API (llama.cpp server, OpenAI-compatible endpoint, etc.) you **must inject a closed thinking block** at the start of the assistant turn:

```python

messages = [

{"role": "system", "content": "Your system prompt here."},

{"role": "user", "content": "Your question here."},

{"role": "assistant", "content": "<think></think>\n"}, # <-- required prefill

]

```

**Why this is required:** Expert surgery retains 50% of the expert pool per layer, selecting experts that are maximally active on domain content and chain-of-thought reasoning. A side effect is that the loop-suppression experts — which activate on metacognitive closure signals near the end of a `<think>` block — do not have a concentrated domain-specific activation signature and are disproportionately pruned. In think-on mode, this causes the model to enter a reasoning loop that exhausts the token budget without producing a final answer. In extreme cases, the loop rate is 60–70% on hard questions.

The `<think></think>` prefill works by consuming the opening `<think>` token before generation starts, so the model sees its thinking as already complete and proceeds directly to answering. This is the mechanism used in all benchmarks reported here.

**What think-off mode does not disable:** Gemma4's chain-of-thought training is deeply ingrained. Even with the think block closed, the model produces brief inline reasoning interleaved with its answer — shorter and more linear than a full scratchpad, but present. All benchmark figures in this README are measured in this constrained-implicit-CoT mode, which is more conservative than the full explicit CoT used by leaderboard entries.

### Ollama Modelfile Template

```

FROM <model_path_or_ollama_tag>

PARAMETER temperature 0.6

PARAMETER repeat_penalty 1.05

PARAMETER num_ctx 8192

PARAMETER num_predict 16384

PARAMETER think false

SYSTEM """

You are an expert mathematician.

Show all working. State the approach before computing. Present final answers clearly labelled.

Return your complete answer, then stop with no further output.

"""

```

Temperature 0.6 is strongly recommended. Higher temperatures (≥ 0.8) materially increase loop rates in think-off mode and reduce numerical precision on applied problems.

---

## What Are These Models?

These models are produced by **activation-directed expert surgery** applied to the Gemma4 26B-A4B instruction-tuned model. The surgery does not change any weight values — it prunes the FFN weight tensors for experts that are not part of the domain-specialist mask, then saves the result as a smaller GGUF. The result is a model that loads approximately 7–8 GB less VRAM for 4 bit quantization than the parent while maintaining the same token throughput (active parameters per forward pass are unchanged: 9 experts fire per token regardless of pool size).

### Memory Efficiency

|---|---|---|---|

| Gemma4-26B parent (Q4_K_M) | 19.4 GB | 20.5 GB | ≈4B |

| Specialist K=64 (Q4_K_M) | **12.3 GB** | **13.4 GB** | ≈4B |

| Q4 savings vs Q4 parent | **7.1 GB (37%)** | **7.1 GB (35%)** | unchanged |

All figures directly measured in Ollama.

Throughput (tokens/second) is identical between the specialist and the parent at the same quantization because the number of expert weight tensors that participate in each forward pass is the same. The saving is purely in VRAM residency — half the expert weight tensors simply do not need to be loaded.

---

## Activation Profiling — How the Masks Are Built

### Step 1 — Corpus Assembly

A domain-representative text corpus is assembled: textbook prose, problem-set Q&A pairs, structured reference material, and (for technical domains) model-generated think-on traces on domain questions. Corpus size: approximately 76,000 tokens across the blended math_standard, math_advanced, and math_supplement sub-corpora.

**Corpus size considerations.** Choosing how much material to include for activation profiling involves two competing pressures. On one side, a corpus that is too small or too narrow may fail to activate the full set of experts that are genuinely relevant to the domain: rare but important concepts may appear in too few tokens to accumulate statistically reliable activation counts, leaving their associated experts underweighted or excluded from the mask. On the other side, a corpus that grows too large — particularly if expansion is driven by including only tangentially related material to hit a token budget — risks diluting the activation signal. If a meaningful fraction of the profiling tokens come from topics that sit at the edge of the domain, the resulting activation histogram begins to resemble a general-purpose model rather than a specialist: the "hot" expert cluster spreads outward and the mask selection becomes less discriminating. The corpus assembled here was grown iteratively, with sub-corpus additions reviewed for domain relevance before inclusion. A more rigorous data-driven approach — one that measures the dispersion or entropy of the emerging activation cluster after each corpus increment and uses that as a stopping criterion — would provide principled feedback to arrest growth at the point of diminishing domain focus. This remains an area of future work.

### Step 2 — 3D Histogram Collection

The full parent model is run in forward-pass mode over the corpus with a hook attached to each MoE layer's router. For each token, the router selects the top-8 experts and assigns softmax weights. The hook accumulates a **3D histogram**:

```

histograms[layer, expert, rank] — integer count of selections

weight_sum[layer, expert, rank] — sum of router softmax weights

```

`rank` runs from 0 (highest-weight expert, primary selection) to 7 (lowest-weight, filler). Capturing per-rank information preserves the router's confidence signal: a rank-0 firing (the expert is the router's first choice) is qualitatively different from a rank-7 filing (the expert fills the last slot with low confidence).

### Step 3 — Utility Scoring

Each (layer, expert) pair receives a scalar utility score:

$$\text{util}[l, e] = \sum_{k=0}^{7} \frac{\text{histograms}[l,e,k]}{N_\text{tokens}} \times \frac{\text{weight\_sum}[l,e,k]}{\max(\text{histograms}[l,e,k], 1)}$$

This is the frequency-weighted mean router confidence — how often the expert is selected, weighted by how much the router trusts it when it does fire. Experts that fire rarely but at high confidence (niche specialists) score proportionally higher than experts that fire frequently at marginal confidence (generalists).

**Expert indices are local to each layer** — expert N in layer 0 and expert N in layer 15 are completely independent entities with no shared weights. All selection and ranking operations are performed per-layer.

### Step 4 — Three-Pass Mask Construction

**Pass 1 — Domain baseline:** Select the top-64 experts per layer by utility score. This captures the most domain-activated experts in the standard activation sense.

**Pass 2 — Structural whitelist enforcement:** A set of experts identified as those experts that have an average activation rank of less than 2 and a minimum number of activations of 10 or more regardless of overall utility ranking. These are then swapped into the model if they are not included already by utility rank by swapping with existing included experts with low average rank and low utility. Applied to physics, chemistry, math, and engineering specialists.

**Pass 3 — CoT/reasoning arbitrage:** Experts that activate strongly on domain agnostic logic/reasoning chain-of-thought traces are swapped into the mask in the same manner as the whitelist experts. This installs logic/reasoning chain-of-thought capacity that is not reliably captured by domain-text profiling alone. Applied at a cap of 6 swaps per layer with the additional requirement that the swap candidate have higher utility than the expert it is replacing; ≈160–175 total swaps across 30 layers.

### Step 5 — GGUF Surgery

The mask JSON specifies which 64 of 128 experts to retain per layer. The surgery script reads the parent GGUF, zeroes the `ffn_gate`, `ffn_up`, and `ffn_down` weight tensors for all non-mask experts, and writes the result as a new GGUF. Tensor norms are verified post-surgery; any NaN or Inf aborts the process. Attention layers, embedding layers, and the shared expert are untouched.

---

## Models in This Release

### Math Specialist

**HF:** `JThomas-CoE/coe-gemma4-math-hc-14b-a4b-q4` | **Ollama:** `coe-gemma4-math-14b-a4b:q4` (Q4_K_M)

Profiling corpus: standard and advanced mathematics textbooks, olympiad-level problem sets, supplementary coverage including number theory, combinatorics, and geometry.

**MATH-500 benchmark** (think_off):

| Category | N | FULL% |

|---|---|---|

| Algebra | 124 | 99.2% |

| Number Theory | 62 | **100.0%** |

| Prealgebra | 82 | 97.6% |

| Precalculus | 56 | 98.2% |

| Counting & Probability | 38 | 94.7% |

| Intermediate Algebra | 97 | 93.8% |

| Geometry | 41 | 85.4% |

| **Overall** | **500** | **96.4%** |

**AIME 2026** (30 problems, post-training-cutoff, think_off): **19/30 (63.3%)** *(rebench 2026-05-17 — 32K context window, loop-retry pass; see note)*

> **AIME rebench note (2026-05-17).** The initial AIME run used an 8K context window. Gemma4-27B requires approximately 24K tokens for a complete AIME chain-of-thought; truncation was silently cutting solutions mid-derivation and causing the model to loop into the token budget without producing a final answer — a confounding factor that suppressed all three models. The rebench used a 32K context window with a loop-retry pass (up to 2 attempts per problem). Scores across all three models improved substantially: HC 53.3% (16/30) → **63.3% (19/30)**, parent Q4 50.0% (15/30) → **73.3% (22/30)**, MMLU-Pro 56.7% (17/30) → **76.7% (23/30)**. All AIME figures in this README reflect the 32K rebench.

**MMLU-Pro Math** (1,351 questions, 10-choice MCQ, think_off):

|---|---|---|---|

| pass@1 | **92.9%** | 93.3% | −0.4 pp |

| pass@5 | **95.6%** | 96.5% | −0.9 pp |

Selected leaderboard comparison (MMLU-Pro Math, TIGER-Lab):

| Model | Score |

|---|---|

| Gemini-3.1-Pro | 95.5% |

| Llama-3.1-Nemotron-70B | 95.5% |

| **coe-gemma4-math-14b-a4b-q4 pass@5** | **≈95.6%** |

| GPT-oss-20B | 91.5% |

| Gemini-2.5-Pro-Exp | 88.8% |

| DeepSeek-V3 (671B) | 86.2% |

| **coe-gemma4-math-14b-a4b-q4 pass@1** | **92.9%** |

*Leaderboard scores sourced from TIGER-Lab MMLU-Pro; individual submitters may report pass@1 or pass@5 — verify methodology before direct comparison.*

At pass@1, the math specialist (12.3 GB VRAM at 16k ctx, 13.4 GB at 64k ctx) beats DeepSeek-V3 (671B total parameters) and Gemini-2.5-Pro-Exp on this benchmark, without chain-of-thought. The parent Q4 deficit of −0.4 pp at pass@1 is within the noise floor for n=1,351.

**Memory note:** 9 experts activate per token at both K=64 and K=128; the pool restriction saves ≈7.1 GB of VRAM at Q4 (measured, approximately context-independent) with identical throughput and zero post-surgery training.

**Related model:** An MMLU-Pro–derived math specialist (`JThomas-CoE/coe-gemma4-math-mmlu_pro-14b-a4b-q4`) produced via the automated pipeline is also available. See [HF_README_coe-gemma4-mmlu_pro_batch.md](https://huggingface.co/JThomas-CoE) for the automated-pipeline methodology and comparison. The hand-curated (this) model achieves 96.4% on MATH-500 and **63.3%** on AIME 2026 (32K ctx rebench). The MMLU-Pro variant scores **96.0%** on MATH-500 (480/500 FULL, think_off) and **76.7%** on AIME 2026 (23/30) — matching this model on MATH-500 and exceeding it by +13.4 pp on AIME. The AIME gap is within the ±8.8 pp 1σ binomial envelope for n=30 problems.

---

## Near/Far Transfer Benchmark — Measuring Semantic Localization

One of the core questions in the College of Experts project is whether activation-directed expert surgery produces genuine **semantic localization** — specialist models whose expert populations are structurally organized around knowledge domains, not just quantization artifacts.

To test this, we ran a 20-pair cross-domain transfer efficiency experiment. Each pair sends one specialist model to answer questions from a different domain's benchmark. "NEAR" pairs are semantically adjacent (physics → math), "FAR" pairs are semantically distant (law → math). If expert populations are semantically organized, NEAR visitors should transfer well and FAR visitors should collapse.

### Transfer Efficiency Metric

Raw accuracy is confounded by domain difficulty. We normalize by the parent model's accuracy on the same questions:

$$Y = \frac{\text{acc}_\text{visitor on target}}{\text{par}_\text{target}}$$

Y = 1.0 means the specialist recovers 100% of what the unmodified parent achieves on the target domain. Y > 1 means it exceeds the parent. This removes the domain difficulty confound: a score of 70% on law (hard for every model) and 70% on math (easy for every model) are not equivalent — normalizing by the parent baseline makes them comparable.

### Structural Similarity Predictors

Two measures of how similar two specialists' expert populations are:

- **F-score (binary):** fraction of experts in the visitor's mask that also appear in the target's mask after subtraction of experts found in all masks (set overlap)

- **Agg-cos (budget-weighted cosine):** cosine similarity between the two models' per-layer activation utility vectors, weighted by expert budget

Both are computed from the activation histograms at zero inference cost.

### Full 20-Pair Results

All benchmarks use OOD off5 split (40% of questions held out from profiling), think_off mode.

| # | Arm | Visitor | Target | n | Acc | Y=Acc/Par | σ\_Y | F-score | Agg-cos |

|---|---|---|---|---|---|---|---|---|---|

| 01 | NEAR | physics | math | 135 | 88.9% | 0.945 | 0.029 | 0.787 | 0.764 |

| 02 | FAR | law | math | 135 | 18.5% | 0.197 | 0.036 | 0.485 | 0.440 |

| 03 | NEAR | chemistry | physics | 130 | 83.1% | 0.952 | 0.038 | 0.855 | 0.842 |

| 04 | FAR | law | physics | 130 | 10.0% | 0.115 | 0.030 | 0.448 | 0.396 |

| 05 | NEAR | physics | engineering | 97 | 63.9% | 0.880 | 0.067 | 0.854 | 0.841 |

| 06 | FAR | law | engineering | 97 | 12.4% | 0.170 | 0.046 | 0.421 | 0.367 |

| 07 | NEAR | math | cs | 41 | 90.2% | **1.089** | 0.056 | 0.723 | 0.700 |

| 08 | FAR | law | cs | 41 | 51.2% | 0.618 | 0.094 | 0.485 | 0.437 |

| 09 | NEAR | physics | chemistry | 113 | 74.3% | 0.844 | 0.047 | 0.855 | 0.842 |

| 10 | FAR | law | chemistry | 113 | 13.3% | 0.151 | 0.036 | 0.413 | 0.363 |

| 11 | NEAR | psychology | biology | 72 | 75.0% | 0.834 | 0.057 | 0.749 | 0.721 |

| 12 | FAR | business | biology | 72 | 54.2% | 0.603 | 0.065 | 0.576 | 0.534 |

| 13 | NEAR | psychology | economics | 84 | 65.5% | 0.745 | 0.059 | 0.739 | 0.713 |

| 14 | FAR | chemistry | economics | 84 | 65.5% | 0.745 | 0.059 | 0.571 | 0.535 |

| 15 | NEAR | math | business | 79 | 77.2% | 0.884 | 0.054 | 0.756 | 0.732 |

| 16 | FAR | law | business | 79 | 13.9% | 0.159 | 0.045 | 0.541 | 0.499 |

| 17 | NEAR | biology | psychology | 80 | 65.0% | 0.782 | 0.064 | 0.749 | 0.721 |

| 18 | FAR | engineering | psychology | 80 | 62.5% | 0.752 | 0.065 | 0.486 | 0.436 |

| 19 | NEAR | economics | law | 110 | 45.5% | 0.730 | 0.076 | 0.701 | 0.676 |

| 20 | FAR | chemistry | law | 110 | 33.6% | 0.540 | 0.072 | 0.413 | 0.363 |

### Key Observations

- **Every NEAR pair outperforms its FAR counterpart.** The near/far gap never reverses. This is the primary finding: semantic proximity in the specialist's training domain predicts transfer efficiency.

- **Math → CS (pair 07, Y = 1.089):** The math specialist *exceeds* the parent on CS questions — the only super-parity point. CS in MMLU-Pro is heavily algorithmic; mathematical reasoning transfers directly and adds capacity the parent's generic routing doesn't prioritize.

- **Math → business (pair 15, Y = 0.884):** The math specialist transfers competently to business questions, outperforming the law FAR pair (Y = 0.159) by a wide margin. Business reasoning in MMLU-Pro has a quantitative component accessible to the math expert population.

- **Law collapses on all STEM targets** (Y = 0.115–0.197 on physics/engineering/chemistry). This is the strongest evidence that expert populations are genuinely semantically organized.

### Correlation with Structural Predictors

Pearson r against transfer efficiency Y (n = 20):

| Predictor | r |

|---|---|

| F-score (expert set overlap) | 0.800 |

| Agg-cos (budget-weighted cosine) | 0.791 |

| Layer variance of alignment (Std/Agg) | −0.746 |

| Raw specialist home accuracy (confounded) | 0.858* |

*Confounded by domain difficulty; collapses to r ≈ 0.615 after normalizing by parent baseline.

**Best multi-predictor model** (Agg-cos + layer-heterogeneity + normalized specialist quality):

R² = 0.796, adj-R² = 0.757, p = 1.28 × 10⁻⁷

### Scatter Plots

Two-panel scatter plot: transfer efficiency (Y) vs F-score (left, r = 0.800) and vs OLS composite fitted value (right, r = 0.892).

![Transfer Efficiency Scatter](plot_transfer_efficiency_g39.png)

NEAR pairs (blue circles) cluster in the top-right of both panels; FAR pairs (red squares) cluster bottom-left. Error bars are ±1σ (binomial standard error propagated through the parent normalization).

---

## Architectural Background — College of Experts

These models are the surgical specialist components of a broader **College of Experts (CoE)** architecture. The full system routes queries to the appropriate specialist at the task level (not the token level), each specialist serving its domain at reduced VRAM cost. A lightweight supervisor model analyzes the incoming query and dispatches to the most capable specialist.

The key insight enabling this approach: Gemma4-26B-A4B's MoE routing is **semantically organized**. Experts that activate strongly on physics problems do not activate strongly on legal reasoning, and vice versa. This is not imposed by training — it emerges from the model's internal organization. The activation profiling and near/far transfer experiments in this release are the empirical demonstration of that structure. The very nature of MoE architectures made this possibility likely and we have found qualitative evidence for it previously in other MoE models (Qwen family) but this work supports it unequivocally and in a different model system.

By fixing K=64 (50% of the expert pool) and selecting the 64 experts most relevant to each domain, we produce a model that:

1. Uses identical VRAM to the original architecture per forward pass (9 experts fire per token either way)

2. Reduces total memory residency by ≈38% at Q4, eliminating the need to keep 64 cold experts in memory

3. Concentrates the active compute on experts that are actually useful for the task

The practical result is near SOTA-competitive performance at consumer-grade hardware footprints. This math specialist beats DeepSeek-V3 (671B parameters) on MMLU-Pro Math while taking just 10 GB of disk space, completely runnable on a mid-tier 16 GB GPU.

---

## Known Limitations

1. **Think-on mode is not reliable.** See the ⚠️ section above. Do not use think_on without a substantially larger token budget (≥ 32,768 num_predict) and expect degraded performance relative to think_off even then.

2. **Geometry is a weak spot.** Geometry scored 85.4% vs 93–100% for other categories on MATH-500. This is the lowest-performing category and the only one with meaningful headroom relative to peer categories. A v2 with dedicated geometry corpus coverage is planned.

3. **Arithmetic precision.** All models produce small floating-point arithmetic errors on hard exponentiation (e.g. Re^0.8, fractional power of 5-digit numbers). The formula and reasoning are almost always correct; the mental arithmetic at the final step is not. A calculation backend (Python tool call, `math.pow` / `math.exp`) eliminates this class of error entirely and is strongly recommended; however a compatible runtime backend is required. Ollama integration for tool calls is currently problematic and left to future work.

4. **No retrieval augmentation out of the box.** The models are able to receive knowledge base context via the `<context>` block pattern. Building a retrieval pipeline around a domain-specific corpus is the highest-leverage deployment improvement available for applied problem solving.

5. **Loop instability at high temperature.** T ≥ 0.8 materially increases loop rates in think_off mode. Use T = 0.6 for all production deployments.

6. **Vision capability — out of scope.** The Gemma4-26B-A4B parent model is natively multimodal (text + image input). All activation profiling in this release was conducted exclusively on text corpora. No effort was made to specifically retain or specifically remove the vision-processing expert population; vision experts were subject to the same 50% selection as all others, with the mask driven entirely by text-domain activation signal. Preliminary testing suggests a nominally coherent remnant of vision capability survives surgery, but fidelity relative to the parent has not been characterized. Vision input should not be treated as a benchmarked or supported capability. Systematic profiling of vision-domain experts and deliberate preservation of the vision pathway is left to future work.

---

## Citation / Attribution

Research and engineering by JThomas-CoE.

- **Project repository:** [College-of-Experts-AI](https://github.com/JThomas-CoE/College-of-Experts-AI) — code, tooling, and methodology documentation

- **Gemma4 methodology:** [gemma4/README.md](https://github.com/JThomas-CoE/College-of-Experts-AI/blob/main/gemma4/README.md)

- **Whitepaper:** [WHITEPAPER.md](https://github.com/JThomas-CoE/College-of-Experts-AI/blob/main/WHITEPAPER.md) — theoretical basis for expert specialization

- **Preprint:** [Separability of Intelligence](https://github.com/JThomas-CoE/College-of-Experts-AI/blob/main/qwen3.5/PREPRINT.md) — empirical evidence from the prior MoE specialist series

- **Model hub:** [huggingface.co/JThomas-CoE](https://huggingface.co/JThomas-CoE)

Base model: Gemma 4 26B-A4B-IT by Google. All specialist weights are derived from the publicly released checkpoint. Usage is subject to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms).

---

## License

Model weights: subject to the Gemma license (see above).

Code and tooling: PolyForm Noncommercial 1.0.0

Commercial licensing: see [LICENSE-COMMERCIAL.md](https://github.com/JThomas-CoE/College-of-Experts-AI/blob/main/LICENSE-COMMERCIAL.md)