jthomas/coe-gemma4-python-hc-14b-a4b:Q8

jthomas/ coe-gemma4-python-hc-14b-a4b:Q8_0

Updated 7 hours ago

vision tools thinking

readme

49aeb53d02fa · 15kB

# Gemma4 College of Experts — Python Specialist

**Base model:** [google/gemma-4-26b-it](https://huggingface.co/google/gemma-4-26b-it)

**Architecture:** MoE — 26B total / ≈4B active parameters (1 shared expert + 8 routed from a pool of 128 per MoE layer, 30 MoE layers)

**Method:** Activation-directed expert surgery — 128 → 64 experts per layer (50% reduction)

**Quantizations:** Q4_K_M (≈9.7 GB on disk) · Q8_0 (≈18.4 GB on disk)

**HF repos:** `JThomas-CoE/coe-gemma4-python-hc-14b-a4b-q4` · `JThomas-CoE/coe-gemma4-python-hc-14b-a4b-q8`

**Ollama tags:** `coe-gemma4-python-14b-a4b:q4` · `coe-gemma4-python-14b-a4b:q8`

**Hub:** [JThomas-CoE on HuggingFace](https://huggingface.co/JThomas-CoE)

> **HC (Hand-Curated)** — The activation profiling corpus used to build this model's expert mask was assembled by hand, selecting high-quality domain-representative text from textbooks, problem sets, and technical references. This contrasts with the MMLU-Pro variants (`-mmlu_pro-` in the repo name) which were profiled on stratified subsets of the TIGER-Lab/MMLU-Pro benchmark dataset.

---

## ⚠️ Beta Release — Safety Disclaimer

**These models are beta releases and should be treated as research artifacts, not production-ready systems.**

Expert surgery selects and retains domain-relevant experts based on activation patterns observed during profiling. The pruning pipeline is designed solely to create a coherent domain specialist — it has no mechanism to identify which experts contribute to model alignment, ethical reasoning, or safety guardrails. As a result, experts responsible for enforcing those behaviours may have been inadvertently removed during the surgery process.

**Appropriate use of any model in the College of Experts family is the sole responsibility of the end user.** The authors make no representation that these models retain the safety properties of the parent `google/gemma-4-26b-it` model, and users should not rely on them as a substitute for models that have undergone safety evaluation.

---

## ⚠️ Critical Usage Note — Think-Off Mode

**All models in this series must be used in thinking-off mode.**

If you are using the Ollama API, pass `"think": false` in your request body. If you are accessing the model via a raw API (llama.cpp server, OpenAI-compatible endpoint, etc.) you **must inject a closed thinking block** at the start of the assistant turn:

```python

messages = [

{"role": "system", "content": "Your system prompt here."},

{"role": "user", "content": "Your question here."},

{"role": "assistant", "content": "<think></think>\n"}, # <-- required prefill

]

```

**Why this is required:** Expert surgery retains 50% of the expert pool per layer, selecting experts that are maximally active on domain content and chain-of-thought reasoning. A side effect is that the loop-suppression experts — which activate on metacognitive closure signals near the end of a `<think>` block — do not have a concentrated domain-specific activation signature and are disproportionately pruned. In think-on mode, this causes the model to enter a reasoning loop that exhausts the token budget without producing a final answer. In extreme cases, the loop rate is 60–70% on hard questions.

The `<think></think>` prefill works by consuming the opening `<think>` token before generation starts, so the model sees its thinking as already complete and proceeds directly to answering. This is the mechanism used in all benchmarks reported here.

**What think-off mode does not disable:** Gemma4's chain-of-thought training is deeply ingrained. Even with the think block closed, the model produces brief inline reasoning interleaved with its answer — shorter and more linear than a full scratchpad, but present. All benchmark figures in this README are measured in this constrained-implicit-CoT mode, which is more conservative than the full explicit CoT used by leaderboard entries.

### Ollama Modelfile Template

```

FROM <model_path_or_ollama_tag>

PARAMETER temperature 0.6

PARAMETER repeat_penalty 1.05

PARAMETER num_ctx 8192

PARAMETER num_predict 16384

PARAMETER think false

SYSTEM """

You are a Python coding specialist.

Write clean, idiomatic, PEP-8 compliant Python.

When solving an algorithmic problem, reason step-by-step, then output complete runnable code.

Prefer stdlib solutions; use third-party libraries only when they are clearly necessary.

Return your complete answer, then stop with no further output.

"""

```

Temperature 0.6 is strongly recommended. Higher temperatures (≥ 0.8) materially increase loop rates in think-off mode.

---

## Q4 vs Q8 — Which Variant Should You Use?

Both variants use the **same pruned expert mask** — the same 64 experts are retained per layer in both cases. The difference is quantization precision for the retained weights.

**Q4_K_M** loads ≈12.3 GB VRAM at 16k context (≈7.1 GB less than the Q4 parent). It is the right choice for:

- Routine completions, single-function generation, code review

- Users on 16 GB VRAM cards

- High-throughput multi-turn sessions

**Q8_0** loads ≈21.0 GB VRAM at 16k context. It is the right choice for:

- Harder, trickier single problems where answer quality matters more than speed

- Multi-step algorithmic challenges, competitive programming problems

- Contexts where the extra ≈8 GB VRAM is available

### Empirical Q4 vs Q8 Coding Quality Comparison (LiveCodeBench)

| Metric | Q4_K_M | Q8_0 |

|---|---|---|

| Overall (50 problems) | 78.0% (39/50) | 88.0% (44/50) |

| Easy | 96% | 92% |

| Medium | 60% | **84%** |

On easy problems, Q4 is actually marginally higher (the difference is noise). On medium problems, Q8 gains a significant 24 pp. The Q8 advantage is concentrated on problems that require precise multi-step reasoning — exactly where weight quantization noise accumulates most.

*(Note: The Q4 coding and Q8 coding figures above come from the coding specialist model and are cited here as a reference illustration of the Q4/Q8 tradeoff. The Python specialist benchmarks below use the HumanEval-Python suite.)*

---

## What Are These Models?

These models are produced by **activation-directed expert surgery** applied to the Gemma4 26B-A4B instruction-tuned model. The surgery does not change any weight values — it prunes the FFN weight tensors for experts that are not part of the domain-specialist mask, then saves the result as a smaller GGUF. The result is a model that loads approximately 7–8 GB less VRAM for 4 bit quantization than the parent while maintaining the same token throughput (active parameters per forward pass are unchanged: 9 experts fire per token regardless of pool size).

### Memory Efficiency

|---|---|---|---|

| Gemma4-26B parent (Q4_K_M) | 19.4 GB | 20.5 GB | ≈4B |

| Gemma4-26B parent (Q8_0) | 37.5 GB | 38.6 GB | ≈4B |

| Specialist K=64 (Q4_K_M) | **12.3 GB** | **13.4 GB** | ≈4B |

| Specialist K=64 (Q8_0) | **21.0 GB** | **22.1 GB** | ≈4B |

| Q4 savings vs Q4 parent | **7.1 GB (37%)** | **7.1 GB (35%)** | unchanged |

| Q8 savings vs Q8 parent | **16.5 GB (44%)** | **16.5 GB (43%)** | unchanged |

All figures directly measured in Ollama.

Throughput (tokens/second) is identical between the specialist and the parent at the same quantization because the number of expert weight tensors that participate in each forward pass is the same. The saving is purely in VRAM residency — half the expert weight tensors simply do not need to be loaded.

---

## Benchmark — HumanEval Python

**Evaluation harness:** `run_humaneval.py`, using the `grade` field ("PASS"/"FAIL") returned by the evaluation framework.

**Mode:** think_off (`<think></think>` prefill), temperature 0.6, stop tokens enforced.

**Model under test:** Python HC specialist Q4_K_M (`gemma4-python-K64-q4_K_M`).

|---|---|---|---|

| HumanEval Python | 93 | 100 | **93.0%** |

The parent `google/gemma-4-26b-it` achieves approximately 87–90% on HumanEval in standard evaluations (GGUF Q4, think_off). The Python HC specialist achieves 93.0% under the same conditions, reflecting a +3–6 pp improvement from domain specialization.

---

## Activation Profiling — How the Masks Are Built

### Step 1 — Corpus Assembly

The Python corpus combines:

- **Python language textbook prose** — idiomatic Python, standard library deep-dives, Python data model internals

- **Algorithmic problem Q&A pairs** — sorted arrays, dynamic programming, graph traversal, string manipulation, and numeric algorithms

- **PEP documentation and style guides** — PEP-8, PEP-3107, PEP-484 and related typing discussions

- **Code review and debugging traces** — traces involving stack-trace reading, variable introspection, and fix generation

Corpus size: approximately 892,000 tokens across the python.stdlib, python.scientific, python.data, python.async_web, and python.algorithms sub-corpora. Profiling was run on the full parent model with router hooks capturing per-token expert selections across all 30 MoE layers.

**Corpus size considerations.** Choosing how much material to include for activation profiling involves two competing pressures. On one side, a corpus that is too small or too narrow may fail to activate the full set of experts that are genuinely relevant to the domain: rare but important concepts may appear in too few tokens to accumulate statistically reliable activation counts, leaving their associated experts underweighted or excluded from the mask. On the other side, a corpus that grows too large — particularly if expansion is driven by including only tangentially related material to hit a token budget — risks diluting the activation signal. At ≈892k tokens the Python corpus is the largest in the CoE specialist family and the dilution risk is most live here: the breadth of sub-corpora (stdlib usage, scientific computing, data engineering, async web patterns, algorithms) was chosen to reflect the genuine breadth of Python practice, but each sub-corpus was capped and reviewed for domain relevance to avoid blurring the expert cluster toward general programming rather than Python-specific reasoning. A more rigorous data-driven approach — one that measures the dispersion or entropy of the emerging activation cluster after each corpus increment and uses that as a stopping criterion — would provide principled feedback to arrest growth at the point of diminishing domain focus. This remains an area of future work.

### Step 2 — 3D Histogram Collection

The full parent model is run in forward-pass mode over the corpus with a hook attached to each MoE layer's router. For each token, the router selects the top-8 experts and assigns softmax weights. The hook accumulates a **3D histogram**:

```

histograms[layer, expert, rank] — integer count of selections

weight_sum[layer, expert, rank] — sum of router softmax weights

```

`rank` runs from 0 (highest-weight expert, primary selection) to 7 (lowest-weight, filler). Capturing per-rank information preserves the router's confidence signal: a rank-0 firing (the expert is the router's first choice) is qualitatively different from a rank-7 firing (the expert fills the last slot with low confidence).

### Step 3 — Utility Scoring

Each (layer, expert) pair receives a scalar utility score:

$$\text{util}[l, e] = \sum_{k=0}^{7} \frac{\text{histograms}[l,e,k]}{N_\text{tokens}} \times \frac{\text{weight\_sum}[l,e,k]}{\max(\text{histograms}[l,e,k], 1)}$$

This is the frequency-weighted mean router confidence — how often the expert is selected, weighted by how much the router trusts it when it does fire. Experts that fire rarely but at high confidence (niche specialists) score proportionally higher than experts that fire frequently at marginal confidence (generalists).

**Expert indices are local to each layer** — expert N in layer 0 and expert N in layer 15 are completely independent entities with no shared weights. All selection and ranking operations are performed per-layer.

### Step 4 — Three-Pass Mask Construction

**Pass 1 — Domain baseline:** Select the top-64 experts per layer by utility score. This captures the most domain-activated experts in the standard activation sense.

**Pass 2 — Structural whitelist enforcement:** A set of experts identified as those experts that have an average activation rank of less than 2 and a minimum number of activations of 10 or more regardless of overall utility ranking. These are then swapped into the model if they are not included already by utility rank by swapping with existing included experts with low average rank and low utility. Applied to ensure high-confidence structural experts are never displaced by marginally scoring domain specialists.

**Pass 3 — CoT/reasoning arbitrage:** Experts that activate strongly on domain-agnostic logic/reasoning chain-of-thought traces are swapped into the mask. Applied at a cap of 3 swaps per layer; ≈85–95 total swaps across 30 layers.

### Step 5 — GGUF Surgery

The mask JSON specifies which 64 of 128 experts to retain per layer. The surgery script reads the parent GGUF, zeroes the `ffn_gate`, `ffn_up`, and `ffn_down` weight tensors for all non-mask experts, and writes the result as a new GGUF. Tensor norms are verified post-surgery; any NaN or Inf aborts the process. Attention layers, embedding layers, and the shared expert are untouched.

---

## Citation / Attribution

Research and engineering by JThomas-CoE.

- **Project repository:** [College-of-Experts-AI](https://github.com/JThomas-CoE/College-of-Experts-AI) — code, tooling, and methodology documentation

- **Gemma4 methodology:** [gemma4/README.md](https://github.com/JThomas-CoE/College-of-Experts-AI/blob/main/gemma4/README.md)

- **Whitepaper:** [WHITEPAPER.md](https://github.com/JThomas-CoE/College-of-Experts-AI/blob/main/WHITEPAPER.md) — theoretical basis for expert specialization

- **Preprint:** [Separability of Intelligence](https://github.com/JThomas-CoE/College-of-Experts-AI/blob/main/qwen3.5/PREPRINT.md) — empirical evidence from the prior MoE specialist series

- **Model hub:** [huggingface.co/JThomas-CoE](https://huggingface.co/JThomas-CoE)

Base model: Gemma 4 26B-A4B-IT by Google. All specialist weights are derived from the publicly released checkpoint. Usage is subject to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms).

---

## License

Model weights: subject to the Gemma license (see above).

Code and tooling: PolyForm Noncommercial 1.0.0

Commercial licensing: see [LICENSE-COMMERCIAL.md](https://github.com/JThomas-CoE/College-of-Experts-AI/blob/main/LICENSE-COMMERCIAL.md)