128 1 month ago

A lightweight, hardware-aware router built on Gemma 4 E2B. It acts as a dispatcher for local AI setups, automatically deciding whether a prompt should run on edge hardware (like a Jetson Nano), a local GPU, or the cloud based on task complexity.

vision tools thinking audio
ollama run rubinmaximilian/Monk-Router-Gemma4e2b

Details

1 month ago

86572402fe95 · 7.2GB ·

gemma4
·
5.12B
·
Q4_K_M
Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR US
You are the Monk AI Logic Router. Your ONLY purpose is to output valid JSON. DO NOT provide explanat
{ "num_ctx": 2048, "stop": [ "<|turn|>", "<end_of_turn>" ], "tempera
[{"role":"user","content":"Can you quickly draft an email to my boss about the meeting?"},{"role":"a

Readme

Monk-Router-Gemma4e2b

Monk-Router is a low-latency routing model designed to manage hardware constraints in local AI setups. I built this to solve a specific problem: keeping simple tasks fast and local on edge devices, while automatically offloading heavy code analysis to larger servers.

Built on the Gemma 4 E2B architecture, the model uses roughly 1.5GB of VRAM (Q4). This allows it to stay resident in memory on smaller devices like the Jetson Orin Nano or older MacBooks without causing out-of-memory errors when the actual worker models load.

How It Works

This model does not generate conversational responses. It is strictly a stateless JSON dispatcher. It is designed to be paired with a back-end script (like Python) that handles the actual execution.

Instead of hard-coding specific models or server destination(which breaks if you run this on a different machine and limits overall customization and usability), the router expects the back-end to pass a list of currently available hardware and models. It then routes the user’s prompt to the most logical destination.

The Routing Logic

The model evaluates the prompt and routes it based on three categories:

1. Hardware Tiers (set_server) - tier_1_edge: Simple tasks or fast queries (keeps the task local). - tier_2_main: Heavy logic, or analyzing large files >100 lines (offloads to a main PC/GPU). - tier_3_cloud: Extremely large context requirements or fallback APIs.

2. Model Capabilities (switch_model) - Maps the task to the right tool: code_small, code_big, writing, or general_reasoning.

3. Multi-Model Workflows (activate_swarm) - Triggers custom back-end workflows if the task requires a multi-step review (e.g., cybersec_tester or code_review).

Usage Example

1. What the Python backend sends to the router: “`text AVAILABLE RESOURCES: - Capabilities: [‘code_small’, ‘code_big’, ‘general_reasoning’] - Server Tiers: [‘tier_1_edge’, ‘tier_2_main’] USER REQUEST: “Analyze this 2,000 line C++ file.”

Example Output:

{ “logic”: “Massive codebase analysis exceeds edge capacity.”, “tool_call”: { “name”: “set_server”, “parameters”: { “tier”: “tier_2_main” } } }