53 2 weeks ago

Compact 500M vision-language model for video/image understanding. Supports visual QA, captioning, OCR, video analysis. Only 1.8GB VRAM. Built on SigLIP + SmolLM2. Available in Q8 and FP16. Apache 2.0 license.

vision

Models

View all →

Readme

SmolVLM2-500M-Video-Instruct

Compact 500M parameter vision-language model optimized for video and image understanding. Requires only 1.8GB VRAM for inference.

Available Variants

  • latest / q8 — Q8_0 quantization, ~546MB (default)
  • fp16 — F16 full precision, ~1GB

Capabilities

  • Video analysis and captioning
  • Image description and visual QA
  • OCR and text extraction
  • Document understanding
  • Multi-image comparison

Usage

# Default (Q8)
ollama run ahmadwaqar/smolvlm2-500m-video "Describe this image" ./photo.jpg

# FP16 (higher quality)
ollama run ahmadwaqar/smolvlm2-500m-video:fp16 "Describe this image" ./photo.jpg

Python

import ollama

response = ollama.chat(
    model='ahmadwaqar/smolvlm2-500m-video',  # or :fp16
    messages=[{
        'role': 'user',
        'content': 'Describe this image',
        'images': ['./image.jpg']
    }]
)
print(response['message']['content'])

API

IMG=$(base64 < image.jpg | tr -d '\n')

curl http://localhost:11434/api/chat -d '{
  "model": "ahmadwaqar/smolvlm2-500m-video",
  "messages": [{
    "role": "user",
    "content": "What is in this image?",
    "images": ["'"$IMG"'"]
  }]
}'

Specs

  • Parameters: 500M
  • Variants: Q8_0 (default), FP16
  • VRAM: ~1.8GB
  • Context: 4096 tokens
  • Architecture: SigLIP + SmolLM2
  • License: Apache 2.0

Benchmarks

  • Video-MME: 42.2
  • MLVU: 47.3
  • MVBench: 39.73

Links

License

Apache 2.0