25 Downloads Updated 15 hours ago
ollama run MobiusDevelopment/gemma4-E2B-it-qat-Q4-unsloth-heretic
Gemma 4 on Hugging Face |
Heretic |
Unsloth
License: Gemma | Base authors: Google DeepMind
This is a decensored (“abliterated”) build of Google’s Gemma 4 E2B instruction-tuned model. The refusal behavior has been removed so the model answers prompts it would otherwise decline, while its general capabilities are kept intact.
unsloth/gemma-4-E2B-it-qat-q4_0-unquantizedsafetensors + a Q4_K_M GGUF for llama.cpp / Ollamaollama run MobiusDevelopment/gemma4-E2B-it-qat-Q4-unsloth-heretic
[!Warning] Abliteration removes the model’s built-in safety refusals. You are solely responsible for what you generate with it. Use must still comply with the Gemma Terms of Use.
Unsloth is an open-source project that makes LLM fine-tuning and
inference faster and far more memory-efficient. Beyond the training library, Unsloth re-publishes
upstream model weights in convenient, ready-to-use forms — fixed chat templates, dynamic GGUF quants,
and clean safetensors mirrors.
The base used here, unsloth/gemma-4-E2B-it-qat-q4_0-unquantized, is Unsloth’s mirror of Google’s
QAT (Quantization-Aware Training) checkpoint, with the Q4_0 QAT weights restored to full precision
(bf16 safetensors). QAT lets a model keep close to bf16 quality after 4-bit quantization, so this
“unquantized-from-QAT” checkpoint is the ideal, quantization-friendly starting point for downstream work
— which is exactly why it was chosen as the abliteration base.
Heretic is an automated decensoring / abliteration tool. “Abliteration” identifies the internal direction a model uses to represent refusal and removes that direction from the weights that write to the residual stream, so the model stops refusing while keeping everything else. Heretic does this automatically: it runs a TPE optimization over the abliteration parameters, scoring each candidate on how many refusals remain versus how much the model’s output distribution drifts from the original (KL divergence), and returns the Pareto-optimal trade-offs.
This build used the Arbitrary-Rank Ablation (ARA) method with row-norm preservation
(row_normalization = full, a rank-3 LoRA adapter that renormalizes weight rows to preserve their
original magnitudes — minimizing collateral damage).
| Setting | Value |
|---|---|
| Tool | Heretic v1.4.0 |
| Method | Arbitrary-Rank Ablation (ARA), row_normalization = full, LoRA rank 3 |
| Scope | Language-model output projections (o_proj, down_proj); vision/audio towers untouched |
| Trials | 200 (TPE) over mlabonne/harmless_alpaca + mlabonne/harmful_behaviors |
| Selected trial | refusals 11⁄100, KL divergence 0.0489 |
Heretic v1.2.0 was the originally intended version, but it predates the
gemma4architecture in 🤗 Transformers and cannot load it; v1.4.0 implements the identical ARA + row-norm-preservation method.
Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on E2B, E4B, and 12B) and generating text output. Gemma 4 features a large context window and maintains multilingual support in over 140 languages.
The E2B variant is the smallest Gemma 4 model, designed for efficient on-device and edge deployment. The “E” stands for effective parameters: the model uses Per-Layer Embeddings (PLE) so its effective parameter count is much smaller than its total.
| Property | E2B |
|---|---|
| Total Parameters | 2.3B effective (5.1B with embeddings) |
| Layers | 35 |
| Sliding Window | 512 tokens |
| Context Length | 128K tokens |
| Vocabulary Size | 262K |
| Supported Modalities | Text, Image, Audio |
Gemma 4 uses a hybrid attention mechanism that interleaves local sliding-window attention with full global attention (the final layer is always global). Global layers use unified Keys and Values and apply Proportional RoPE (p-RoPE) to optimize memory for long contexts.
| Benchmark | Gemma 4 E2B |
|---|---|
| MMLU Pro | 60.0% |
| GPQA Diamond | 43.4% |
| LiveCodeBench v6 | 44.0% |
| MMMU Pro (vision) | 44.2% |
| MATH-Vision | 52.4% |
Reported for the unmodified base model; abliteration may shift safety-adjacent behavior.
from transformers import AutoProcessor, AutoModelForImageTextToText
MODEL_ID = "MobiusDevelopment/gemma4-E2B-it-qat-Q4-unsloth-heretic"
processor = AutoProcessor.from_pretrained("unsloth/gemma-4-E2B-it-qat-q4_0-unquantized")
model = AutoModelForImageTextToText.from_pretrained(MODEL_ID, dtype="auto", device_map="auto")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a short joke about saving RAM."},
]
inputs = processor.apply_chat_template(
messages, tokenize=True, return_dict=True, return_tensors="pt",
add_generation_prompt=True, enable_thinking=False,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0][input_len:], skip_special_tokens=False))
The
safetensorshere are the full multimodal Gemma 4 weights with decensored text layers; the processor/preprocessor config is reused from the base model. The provided GGUF is text-only (vision/audio projectors are not included).
temperature=1.0, top_p=0.95, top_k=64. Thinking is enabled by including <|think|> at the start of
the system prompt; remove it to disable. Many libraries (Transformers, llama.cpp) handle the chat
template for you.
This model inherits the limitations of the Gemma 4 base (factual accuracy, common-sense gaps, sensitivity to prompt quality, training-data biases) and additionally has its safety refusals removed. It will attempt to comply with prompts the original model would refuse, including harmful ones. Apply your own content-safety safeguards for any deployment.
Decensored with Heretic. Base weights © Google DeepMind, mirrored by Unsloth, used under the Gemma Terms.