115 1 week ago

Ultra-compact 256M vision-language model for video/image understanding. Supports visual QA, captioning, OCR, video analysis. Only 1.38GB VRAM. Built on SigLIP + SmolLM2. Available in Q8 and FP16. Apache 2.0 license.

vision
ollama run ahmadwaqar/smolvlm2-256m-video:fp16

Details

1 week ago

ae2d9ecd464d · 518MB ·

llama
·
163M
·
F16
clip
·
93.5M
·
F16
<|im_start|>{{ if .System }}System: {{ .System }}<end_of_utterance> {{ end }}User: {{ .Prompt }}<end
{ "num_ctx": 4096, "stop": [ "<end_of_utterance>" ] }

Readme

SmolVLM2-256M-Video-Instruct

Ultra-compact 256M parameter vision-language model optimized for video and image understanding. Requires only 1.38GB VRAM for inference. The smallest video language model ever released.

Available Variants

  • latest / q8 — Q8_0 quantization, ~175MB (default)
  • fp16 — F16 full precision, ~328MB

Capabilities

  • Video analysis and captioning
  • Image description and visual QA
  • OCR and text extraction
  • Document understanding
  • Multi-image comparison

Usage

# Default (Q8)
ollama run ahmadwaqar/smolvlm2-256m-video "Describe this image" ./photo.jpg

# FP16 (higher quality)
ollama run ahmadwaqar/smolvlm2-256m-video:fp16 "Describe this image" ./photo.jpg

Python

import ollama

response = ollama.chat(
    model='ahmadwaqar/smolvlm2-256m-video',  # or :fp16
    messages=[{
        'role': 'user',
        'content': 'Describe this image',
        'images': ['./image.jpg']
    }]
)
print(response['message']['content'])

API

IMG=$(base64 < image.jpg | tr -d '\n')

curl http://localhost:11434/api/chat -d '{
  "model": "ahmadwaqar/smolvlm2-256m-video",
  "messages": [{
    "role": "user",
    "content": "What is in this image?",
    "images": ["'"$IMG"'"]
  }]
}'

Specs

  • Parameters: 256M
  • Variants: Q8_0 (default), FP16
  • VRAM: ~1.38GB
  • Context: 2048 tokens
  • Architecture: SigLIP + SmolLM2
  • License: Apache 2.0

Benchmarks

  • Video-MME: 33.7
  • MLVU: 40.6
  • MVBench: 32.7

Links

License

Apache 2.0 “`


Summary Comparison Table

Spec 256M Model 500M Model
Parameters 256M 500M
VRAM ~1.38GB ~1.8GB
Q8 Size ~175MB ~546MB
FP16 Size ~328MB ~1GB
Context 2048 tokens 4096 tokens
Video-MME 33.7 42.2
MLVU 40.6 47.3
MVBench 32.7 39.73