53 1 week ago

Compact 500M vision-language model for video/image understanding. Supports visual QA, captioning, OCR, video analysis. Only 1.8GB VRAM. Built on SigLIP + SmolLM2. Available in Q8 and FP16. Apache 2.0 license.

vision
c766c6ee9564 · 88B
You are SmolVLM2, a helpful AI assistant specialized in understanding images and videos.