Fast, lightweight Gemma 4 E2B coding agent for Claude Code, 64K context, native tool-calling, 100% GPU on 16GB Apple Silicon.

Applications

Claude Code ollama launch claude --model rafw007/gemma4-e2b-claude-coder

Codex App ollama launch codex-app --model rafw007/gemma4-e2b-claude-coder

OpenClaw ollama launch openclaw --model rafw007/gemma4-e2b-claude-coder

Hermes Agent ollama launch hermes --model rafw007/gemma4-e2b-claude-coder

Codex ollama launch codex --model rafw007/gemma4-e2b-claude-coder

OpenCode ollama launch opencode --model rafw007/gemma4-e2b-claude-coder

Gemma 4 Claude Coder — local model family

A family of custom models built on Gemma 4 (edge variants E2B and E4B), tuned to act as autonomous coding and administration agents. The models speak the Anthropic-compatible API, so they drive Claude Code fully locally — your code never leaves your machine and cloud token cost drops to zero.

Each model ships with a system prompt focused on real work inside a codebase: use tools instead of guessing, make minimal and precise code changes, return complete and runnable output, and verify after acting. Sampling follows Google’s official Gemma 4 recommendation (temperature 1.0, top_k 64, top_p 0.95), with thinking mode enabled for better planning before a tool call.

Models in the family

Model	Base	Context	Purpose
gemma4-e2b-claude-coder	Gemma 4 E2B (eff. 2B / 5.1B with embeddings)	64K	Fast everyday coding agent — edits, autocomplete, short agent loops. Lightest on memory.
gemma4-e4b-claude-coder	Gemma 4 E4B (eff. 4B / 8B with embeddings)	64K	Stronger coding agent — better reasoning and tool use on larger tasks.
gemma4-e4b-claude-coder-admin	Gemma 4 E4B	32K	Administration and system tasks (scripts, shell, devops). Smaller context fits 100% in GPU for higher, stable throughput.

What it’s for

Driving Claude Code locally (ollama launch claude --model <name>).
Agentic code writing and editing with native function calling / tool use.
Administration and devops tasks on a server (the admin variant).
Full privacy and offline operation — no code sent to the cloud.

Context

Coders (E2B / E4B): 64K tokens — matching Claude Code’s recommendation (64K minimum).
Admin (E4B): 32K tokens — a deliberate trade-off for 16 GB hardware that keeps the model entirely on the GPU.
Base Gemma 4 E2B/E4B natively supports up to 128K, so context can be raised on stronger hardware.

Test hardware

The models were built and tested on:

Mac Mini (Apple Silicon, M-series), 16 GB RAM, macOS 15.6
Ollama 0.24, GPU (Metal) inference

Measured performance (16 GB RAM)

Model	Placement	Speed	Tool calling
gemma4-e2b-claude-coder	100% GPU	~55 tok/s	✅ valid JSON
gemma4-e4b-claude-coder (64K)	39% GPU / 61% CPU	~27 tok/s (drops under load)	✅
gemma4-e4b-claude-coder-admin (32K)	100% GPU	~30 tok/s (stable)	✅

All three passed an end-to-end test through Claude Code: real turns with tool calls and correct responses (HTTP 200 on /v1/messages).

How they were made

These models were designed, built and tested with the help of Claude Opus 4.8 — the best coding model in the world. Their system prompts, parameter choices and context configuration draw directly on its knowledge. In other words: the world’s best coding model prepared local models that take that work over right on your desk.

License

Apache 2.0 (inherited from the base Gemma 4).