258 2 weeks ago

A structurally extracted, text-only iteration of Google's multimodal gemma-4-E4B-it model. Vision and audio encoders have been fully decoupled to minimize VRAM footprint for text-centric workloads. System Prompt to address lost abilities.

tools thinking
{
"stop": [
"<end_of_turn>"
]
}