13 2 hours ago

A low-latency agentic router powered by the brand-new Gemma 4 E2B (Effective 2B) architecture. Optimized for the Monk AI framework to provide near-instant task delegation and "Edge-to-Business" tool-calling on Jetson hardware.

vision tools thinking audio
ollama run rubinmaximilian/Monk-Router-Gemma4e2b

Applications

Claude Code
Claude Code ollama launch claude --model rubinmaximilian/Monk-Router-Gemma4e2b
Codex
Codex ollama launch codex --model rubinmaximilian/Monk-Router-Gemma4e2b
OpenCode
OpenCode ollama launch opencode --model rubinmaximilian/Monk-Router-Gemma4e2b
OpenClaw
OpenClaw ollama launch openclaw --model rubinmaximilian/Monk-Router-Gemma4e2b

Models

View all →

Readme

Monk-Router-gemma4e2b

Monk-Router-gemma4e2b is a performance-first router designed for the Monk AI assistant. Leveraging Google’s 2026 E2B (Effective 2 Billion) architecture, it offers the fastest possible decision-making speed for edge computing environments (versus the similar model based on Phi4-mini).

This model is ideal for users prioritizing latency and VRAM efficiency on devices like the Jetson Orin Nano or MacBook Air. Please let me know if I should make an even larger model for scaled applications!

Performance

  • Low VRAM Footprint: Uses ~1.5GB of VRAM, allowing it to stay resident while worker models load.
  • Agentic Efficiency: Built on Gemma 4’s native tool-calling distillation for superior instruction following.
  • Speed: Optimized for “Time to First Token” (TTFT) in real-time assistant workflows.

Logic Thresholds

  • Edge Route: Simple queries, small code snippets (<100 lines), and general chat.
  • GPU Route: High-VRAM requirements, multi-file analysis, and thermal-intensive tasks.

Example Output

”`json { “logic”: “General logic task. Keeping on local Jetson.”, “tool_call”: { “name”: “switch_model”, “parameters”: { “model_name”: “gemma4-e2b” } } }