116 1 week ago

Ultra-compact 256M vision-language model for video/image understanding. Supports visual QA, captioning, OCR, video analysis. Only 1.38GB VRAM. Built on SigLIP + SmolLM2. Available in Q8 and FP16. Apache 2.0 license.

vision
dbfd8e8c08ef · 57B
{
"num_ctx": 4096,
"stop": [
"<end_of_utterance>"
]
}