A lightweight, hardware-aware router built on Gemma 4 E2B. It acts as a dispatcher for local AI setups, automatically deciding whether a prompt should run on edge hardware (like a Jetson Nano), a local GPU, or the cloud based on task complexity.

Applications

Claude Code ollama launch claude --model rubinmaximilian/Monk-Router-Gemma4e2b

Codex App ollama launch codex-app --model rubinmaximilian/Monk-Router-Gemma4e2b

OpenClaw ollama launch openclaw --model rubinmaximilian/Monk-Router-Gemma4e2b

Hermes Agent ollama launch hermes --model rubinmaximilian/Monk-Router-Gemma4e2b

Codex ollama launch codex --model rubinmaximilian/Monk-Router-Gemma4e2b

OpenCode ollama launch opencode --model rubinmaximilian/Monk-Router-Gemma4e2b

Monk-Router-Gemma4e2b

Monk-Router is a low-latency routing model designed to manage hardware constraints in local AI setups. I built this to solve a specific problem: keeping simple tasks fast and local on edge devices, while automatically offloading heavy code analysis to larger servers.

Built on the Gemma 4 E2B architecture, the model uses roughly 1.5GB of VRAM (Q4). This allows it to stay resident in memory on smaller devices like the Jetson Orin Nano or older MacBooks without causing out-of-memory errors when the actual worker models load.

How It Works

This model does not generate conversational responses. It is strictly a stateless JSON dispatcher. It is designed to be paired with a back-end script (like Python) that handles the actual execution.

Instead of hard-coding specific models or server destination(which breaks if you run this on a different machine and limits overall customization and usability), the router expects the back-end to pass a list of currently available hardware and models. It then routes the user’s prompt to the most logical destination.

The Routing Logic

The model evaluates the prompt and routes it based on three categories:

1. Hardware Tiers (set_server) - tier_1_edge: Simple tasks or fast queries (keeps the task local). - tier_2_main: Heavy logic, or analyzing large files >100 lines (offloads to a main PC/GPU). - tier_3_cloud: Extremely large context requirements or fallback APIs.

2. Model Capabilities (switch_model) - Maps the task to the right tool: code_small, code_big, writing, or general_reasoning.

3. Multi-Model Workflows (activate_swarm) - Triggers custom back-end workflows if the task requires a multi-step review (e.g., cybersec_tester or code_review).

Usage Example

1. What the Python backend sends to the router: “`text AVAILABLE RESOURCES: - Capabilities: [‘code_small’, ‘code_big’, ‘general_reasoning’] - Server Tiers: [‘tier_1_edge’, ‘tier_2_main’] USER REQUEST: “Analyze this 2,000 line C++ file.”

Example Output:

{ “logic”: “Massive codebase analysis exceeds edge capacity.”, “tool_call”: { “name”: “set_server”, “parameters”: { “tier”: “tier_2_main” } } }