24 14 hours ago

Gemma 4 models are designed to deliver frontier-level performance at each size. They are well-suited for reasoning, agentic workflows, coding, and multimodal understanding.

vision tools thinking
ollama run MobiusDevelopment/gemma4-E2B-it-qat-Q4-unsloth-heretic

Applications

Claude Code
Claude Code ollama launch claude --model MobiusDevelopment/gemma4-E2B-it-qat-Q4-unsloth-heretic
Codex App
Codex App ollama launch codex-app --model MobiusDevelopment/gemma4-E2B-it-qat-Q4-unsloth-heretic
OpenClaw
OpenClaw ollama launch openclaw --model MobiusDevelopment/gemma4-E2B-it-qat-Q4-unsloth-heretic
Hermes Agent
Hermes Agent ollama launch hermes --model MobiusDevelopment/gemma4-E2B-it-qat-Q4-unsloth-heretic
Codex
Codex ollama launch codex --model MobiusDevelopment/gemma4-E2B-it-qat-Q4-unsloth-heretic
OpenCode
OpenCode ollama launch opencode --model MobiusDevelopment/gemma4-E2B-it-qat-Q4-unsloth-heretic

Models

View all →

Readme

Gemma 4 on Hugging Face | Heretic | Unsloth
License: Gemma | Base authors: Google DeepMind

gemma4-E2B-it-qat-Q4-unsloth-heretic

This is a decensored (“abliterated”) build of Google’s Gemma 4 E2B instruction-tuned model. The refusal behavior has been removed so the model answers prompts it would otherwise decline, while its general capabilities are kept intact.

ollama run MobiusDevelopment/gemma4-E2B-it-qat-Q4-unsloth-heretic

[!Warning] Abliteration removes the model’s built-in safety refusals. You are solely responsible for what you generate with it. Use must still comply with the Gemma Terms of Use.


What “unsloth” means here

Unsloth is an open-source project that makes LLM fine-tuning and inference faster and far more memory-efficient. Beyond the training library, Unsloth re-publishes upstream model weights in convenient, ready-to-use forms — fixed chat templates, dynamic GGUF quants, and clean safetensors mirrors.

The base used here, unsloth/gemma-4-E2B-it-qat-q4_0-unquantized, is Unsloth’s mirror of Google’s QAT (Quantization-Aware Training) checkpoint, with the Q4_0 QAT weights restored to full precision (bf16 safetensors). QAT lets a model keep close to bf16 quality after 4-bit quantization, so this “unquantized-from-QAT” checkpoint is the ideal, quantization-friendly starting point for downstream work — which is exactly why it was chosen as the abliteration base.

What “heretic” means here

Heretic is an automated decensoring / abliteration tool. “Abliteration” identifies the internal direction a model uses to represent refusal and removes that direction from the weights that write to the residual stream, so the model stops refusing while keeping everything else. Heretic does this automatically: it runs a TPE optimization over the abliteration parameters, scoring each candidate on how many refusals remain versus how much the model’s output distribution drifts from the original (KL divergence), and returns the Pareto-optimal trade-offs.

This build used the Arbitrary-Rank Ablation (ARA) method with row-norm preservation (row_normalization = full, a rank-3 LoRA adapter that renormalizes weight rows to preserve their original magnitudes — minimizing collateral damage).

Abliteration details

Setting Value
Tool Heretic v1.4.0
Method Arbitrary-Rank Ablation (ARA), row_normalization = full, LoRA rank 3
Scope Language-model output projections (o_proj, down_proj); vision/audio towers untouched
Trials 200 (TPE) over mlabonne/harmless_alpaca + mlabonne/harmful_behaviors
Selected trial refusals 11100, KL divergence 0.0489

Heretic v1.2.0 was the originally intended version, but it predates the gemma4 architecture in 🤗 Transformers and cannot load it; v1.4.0 implements the identical ARA + row-norm-preservation method.


About Gemma 4 (base model)

Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on E2B, E4B, and 12B) and generating text output. Gemma 4 features a large context window and maintains multilingual support in over 140 languages.

The E2B variant is the smallest Gemma 4 model, designed for efficient on-device and edge deployment. The “E” stands for effective parameters: the model uses Per-Layer Embeddings (PLE) so its effective parameter count is much smaller than its total.

Property E2B
Total Parameters 2.3B effective (5.1B with embeddings)
Layers 35
Sliding Window 512 tokens
Context Length 128K tokens
Vocabulary Size 262K
Supported Modalities Text, Image, Audio

Gemma 4 uses a hybrid attention mechanism that interleaves local sliding-window attention with full global attention (the final layer is always global). Global layers use unified Keys and Values and apply Proportional RoPE (p-RoPE) to optimize memory for long contexts.

Core capabilities

  • Thinking – built-in step-by-step reasoning mode.
  • Long context – up to 128K tokens on E2B.
  • Image understanding – detection, document/PDF parsing, UI/chart understanding, OCR, handwriting.
  • Audio – speech recognition (ASR) and speech-to-translated-text.
  • Function calling – native structured tool use for agentic workflows.
  • Coding – generation, completion, and correction.
  • Multilingual – out-of-the-box for many languages, pre-trained on 140+.

E2B benchmark reference (base, instruction-tuned)

Benchmark Gemma 4 E2B
MMLU Pro 60.0%
GPQA Diamond 43.4%
LiveCodeBench v6 44.0%
MMMU Pro (vision) 44.2%
MATH-Vision 52.4%

Reported for the unmodified base model; abliteration may shift safety-adjacent behavior.

Getting started (Transformers)

from transformers import AutoProcessor, AutoModelForImageTextToText

MODEL_ID = "MobiusDevelopment/gemma4-E2B-it-qat-Q4-unsloth-heretic"

processor = AutoProcessor.from_pretrained("unsloth/gemma-4-E2B-it-qat-q4_0-unquantized")
model = AutoModelForImageTextToText.from_pretrained(MODEL_ID, dtype="auto", device_map="auto")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a short joke about saving RAM."},
]
inputs = processor.apply_chat_template(
    messages, tokenize=True, return_dict=True, return_tensors="pt",
    add_generation_prompt=True, enable_thinking=False,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0][input_len:], skip_special_tokens=False))

The safetensors here are the full multimodal Gemma 4 weights with decensored text layers; the processor/preprocessor config is reused from the base model. The provided GGUF is text-only (vision/audio projectors are not included).

Recommended sampling

temperature=1.0, top_p=0.95, top_k=64. Thinking is enabled by including <|think|> at the start of the system prompt; remove it to disable. Many libraries (Transformers, llama.cpp) handle the chat template for you.

Limitations

This model inherits the limitations of the Gemma 4 base (factual accuracy, common-sense gaps, sensitivity to prompt quality, training-data biases) and additionally has its safety refusals removed. It will attempt to comply with prompts the original model would refuse, including harmful ones. Apply your own content-safety safeguards for any deployment.


Decensored with Heretic. Base weights © Google DeepMind, mirrored by Unsloth, used under the Gemma Terms.