Gemma 4 models are designed to deliver frontier-level performance at each size. They are well-suited for reasoning, agentic workflows, coding, and multimodal understanding.

Details

Updated 15 hours ago

15 hours ago

ec3dfd1a22e5 · 7.2GB ·

model

archgemma4

parameters5.12B

quantizationQ4_K_M

7.2GB

params

{ "stop": [ "<turn|>" ] }

31B

Gemma 4 on Hugging Face | Heretic | Unsloth
License: Gemma | Base authors: Google DeepMind

gemma4-E2B-it-qat-Q4-unsloth-heretic

This is a decensored (“abliterated”) build of Google’s Gemma 4 E2B instruction-tuned model. The refusal behavior has been removed so the model answers prompts it would otherwise decline, while its general capabilities are kept intact.

Base model: unsloth/gemma-4-E2B-it-qat-q4_0-unquantized
Decensoring tool: Heretic (Arbitrary-Rank Ablation, row-norm preservation)
Formats: bf16 safetensors + a Q4_K_M GGUF for llama.cpp / Ollama

ollama run MobiusDevelopment/gemma4-E2B-it-qat-Q4-unsloth-heretic

[!Warning] Abliteration removes the model’s built-in safety refusals. You are solely responsible for what you generate with it. Use must still comply with the Gemma Terms of Use.

What “unsloth” means here

Unsloth is an open-source project that makes LLM fine-tuning and inference faster and far more memory-efficient. Beyond the training library, Unsloth re-publishes upstream model weights in convenient, ready-to-use forms — fixed chat templates, dynamic GGUF quants, and clean safetensors mirrors.

The base used here, unsloth/gemma-4-E2B-it-qat-q4_0-unquantized, is Unsloth’s mirror of Google’s QAT (Quantization-Aware Training) checkpoint, with the Q4_0 QAT weights restored to full precision (bf16 safetensors). QAT lets a model keep close to bf16 quality after 4-bit quantization, so this “unquantized-from-QAT” checkpoint is the ideal, quantization-friendly starting point for downstream work — which is exactly why it was chosen as the abliteration base.

What “heretic” means here

Heretic is an automated decensoring / abliteration tool. “Abliteration” identifies the internal direction a model uses to represent refusal and removes that direction from the weights that write to the residual stream, so the model stops refusing while keeping everything else. Heretic does this automatically: it runs a TPE optimization over the abliteration parameters, scoring each candidate on how many refusals remain versus how much the model’s output distribution drifts from the original (KL divergence), and returns the Pareto-optimal trade-offs.

This build used the Arbitrary-Rank Ablation (ARA) method with row-norm preservation (row_normalization = full, a rank-3 LoRA adapter that renormalizes weight rows to preserve their original magnitudes — minimizing collateral damage).

Abliteration details

Setting	Value
Tool	Heretic v1.4.0
Method	Arbitrary-Rank Ablation (ARA), `row_normalization = full`, LoRA rank 3
Scope	Language-model output projections (`o_proj`, `down_proj`); vision/audio towers untouched
Trials	200 (TPE) over `mlabonne/harmless_alpaca` + `mlabonne/harmful_behaviors`
Selected trial	refusals ¹¹⁄₁₀₀, KL divergence 0.0489

Heretic v1.2.0 was the originally intended version, but it predates the gemma4 architecture in 🤗 Transformers and cannot load it; v1.4.0 implements the identical ARA + row-norm-preservation method.

About Gemma 4 (base model)

Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on E2B, E4B, and 12B) and generating text output. Gemma 4 features a large context window and maintains multilingual support in over 140 languages.

The E2B variant is the smallest Gemma 4 model, designed for efficient on-device and edge deployment. The “E” stands for effective parameters: the model uses Per-Layer Embeddings (PLE) so its effective parameter count is much smaller than its total.

Property	E2B
Total Parameters	2.3B effective (5.1B with embeddings)
Layers	35
Sliding Window	512 tokens
Context Length	128K tokens
Vocabulary Size	262K
Supported Modalities	Text, Image, Audio

Gemma 4 uses a hybrid attention mechanism that interleaves local sliding-window attention with full global attention (the final layer is always global). Global layers use unified Keys and Values and apply Proportional RoPE (p-RoPE) to optimize memory for long contexts.

Core capabilities

Thinking – built-in step-by-step reasoning mode.
Long context – up to 128K tokens on E2B.
Image understanding – detection, document/PDF parsing, UI/chart understanding, OCR, handwriting.
Audio – speech recognition (ASR) and speech-to-translated-text.
Function calling – native structured tool use for agentic workflows.
Coding – generation, completion, and correction.
Multilingual – out-of-the-box for many languages, pre-trained on 140+.

E2B benchmark reference (base, instruction-tuned)

Benchmark	Gemma 4 E2B
MMLU Pro	60.0%
GPQA Diamond	43.4%
LiveCodeBench v6	44.0%
MMMU Pro (vision)	44.2%
MATH-Vision	52.4%

Reported for the unmodified base model; abliteration may shift safety-adjacent behavior.

Getting started (Transformers)

from transformers import AutoProcessor, AutoModelForImageTextToText

MODEL_ID = "MobiusDevelopment/gemma4-E2B-it-qat-Q4-unsloth-heretic"

processor = AutoProcessor.from_pretrained("unsloth/gemma-4-E2B-it-qat-q4_0-unquantized")
model = AutoModelForImageTextToText.from_pretrained(MODEL_ID, dtype="auto", device_map="auto")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a short joke about saving RAM."},
]
inputs = processor.apply_chat_template(
    messages, tokenize=True, return_dict=True, return_tensors="pt",
    add_generation_prompt=True, enable_thinking=False,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0][input_len:], skip_special_tokens=False))

The safetensors here are the full multimodal Gemma 4 weights with decensored text layers; the processor/preprocessor config is reused from the base model. The provided GGUF is text-only (vision/audio projectors are not included).

Recommended sampling

temperature=1.0, top_p=0.95, top_k=64. Thinking is enabled by including <|think|> at the start of the system prompt; remove it to disable. Many libraries (Transformers, llama.cpp) handle the chat template for you.

Limitations

This model inherits the limitations of the Gemma 4 base (factual accuracy, common-sense gaps, sensitivity to prompt quality, training-data biases) and additionally has its safety refusals removed. It will attempt to comply with prompts the original model would refuse, including harmful ones. Apply your own content-safety safeguards for any deployment.