128 Downloads Updated 1 month ago
ollama run rubinmaximilian/Monk-Router-Gemma4e2b
Updated 1 month ago
1 month ago
86572402fe95 · 7.2GB ·
Monk-Router is a low-latency routing model designed to manage hardware constraints in local AI setups. I built this to solve a specific problem: keeping simple tasks fast and local on edge devices, while automatically offloading heavy code analysis to larger servers.
Built on the Gemma 4 E2B architecture, the model uses roughly 1.5GB of VRAM (Q4). This allows it to stay resident in memory on smaller devices like the Jetson Orin Nano or older MacBooks without causing out-of-memory errors when the actual worker models load.
This model does not generate conversational responses. It is strictly a stateless JSON dispatcher. It is designed to be paired with a back-end script (like Python) that handles the actual execution.
Instead of hard-coding specific models or server destination(which breaks if you run this on a different machine and limits overall customization and usability), the router expects the back-end to pass a list of currently available hardware and models. It then routes the user’s prompt to the most logical destination.
The model evaluates the prompt and routes it based on three categories:
1. Hardware Tiers (set_server)
- tier_1_edge: Simple tasks or fast queries (keeps the task local).
- tier_2_main: Heavy logic, or analyzing large files >100 lines (offloads to a main PC/GPU).
- tier_3_cloud: Extremely large context requirements or fallback APIs.
2. Model Capabilities (switch_model)
- Maps the task to the right tool: code_small, code_big, writing, or general_reasoning.
3. Multi-Model Workflows (activate_swarm)
- Triggers custom back-end workflows if the task requires a multi-step review (e.g., cybersec_tester or code_review).
1. What the Python backend sends to the router: “`text AVAILABLE RESOURCES: - Capabilities: [‘code_small’, ‘code_big’, ‘general_reasoning’] - Server Tiers: [‘tier_1_edge’, ‘tier_2_main’] USER REQUEST: “Analyze this 2,000 line C++ file.”
{ “logic”: “Massive codebase analysis exceeds edge capacity.”, “tool_call”: { “name”: “set_server”, “parameters”: { “tier”: “tier_2_main” } } }