Ultra-compact 256M vision-language model for video/image understanding. Supports visual QA, captioning, OCR, video analysis. Only 1.38GB VRAM. Built on SigLIP + SmolLM2. Available in Q8 and FP16. Apache 2.0 license.

Details

Updated 6 months ago

6 months ago

ae2d9ecd464d · 518MB ·

model

archllama

parameters163M

quantizationF16

328MB

projector

archclip

parameters93.5M

quantizationF16

190MB

template

<|im_start|>{{ if .System }}System: {{ .System }}<end_of_utterance> {{ end }}User: {{ .Prompt }}<end

160B

params

{ "num_ctx": 4096, "stop": [ "<end_of_utterance>" ] }

57B

SmolVLM2-256M-Video-Instruct

Ultra-compact 256M parameter vision-language model optimized for video and image understanding. Requires only 1.38GB VRAM for inference. The smallest video language model ever released.

Available Variants

latest / q8 — Q8_0 quantization, ~175MB (default)
fp16 — F16 full precision, ~328MB

Capabilities

Video analysis and captioning
Image description and visual QA
OCR and text extraction
Document understanding
Multi-image comparison

Usage

# Default (Q8)
ollama run ahmadwaqar/smolvlm2-256m-video "Describe this image" ./photo.jpg

# FP16 (higher quality)
ollama run ahmadwaqar/smolvlm2-256m-video:fp16 "Describe this image" ./photo.jpg

Python

import ollama

response = ollama.chat(
    model='ahmadwaqar/smolvlm2-256m-video',  # or :fp16
    messages=[{
        'role': 'user',
        'content': 'Describe this image',
        'images': ['./image.jpg']
    }]
)
print(response['message']['content'])

API

IMG=$(base64 < image.jpg | tr -d '\n')

curl http://localhost:11434/api/chat -d '{
  "model": "ahmadwaqar/smolvlm2-256m-video",
  "messages": [{
    "role": "user",
    "content": "What is in this image?",
    "images": ["'"$IMG"'"]
  }]
}'

Specs

Parameters: 256M
Variants: Q8_0 (default), FP16
VRAM: ~1.38GB
Context: 2048 tokens
Architecture: SigLIP + SmolLM2
License: Apache 2.0

Benchmarks

Video-MME: 33.7
MLVU: 40.6
MVBench: 32.7

License

Apache 2.0 “`

Summary Comparison Table

Spec	256M Model	500M Model
Parameters	256M	500M
VRAM	~1.38GB	~1.8GB
Q8 Size	~175MB	~546MB
FP16 Size	~328MB	~1GB
Context	2048 tokens	4096 tokens
Video-MME	33.7	42.2
MLVU	40.6	47.3
MVBench	32.7	39.73