63 2 weeks ago

SmolVLM2-2.2B-Instruct is a compact multimodal model for image and video understanding. Built on SmolLM2-1.7B with SigLIP vision encoder. Supports visual QA, OCR, and video analysis. Available in Q8 and FP16 quantizations. Apache 2.0 license.

vision

2 weeks ago

82d4cccc4c9b · 4.5GB ·

llama
·
1.81B
·
F16
clip
·
434M
·
F16
{{- if .System }}<|im_start|>system {{ .System }}<|im_end|> {{ end }}{{- range .Messages }}{{- if eq
{ "num_ctx": 8192, "stop": [ "<|im_end|>", "<end_of_utterance>" ] }

Readme

SmolVLM2-2.2B-Instruct

A compact yet powerful vision-language model from Hugging Face.

Features

  • Image & Video Understanding: Describe images, answer visual questions, analyze documents
  • 2.2B Parameters: Efficient enough for edge deployment
  • Multiple Quantizations: Q8_0 and FP16 variants available
  • Apache 2.0: Fully open source

Available Variants

Tag Quantization Size Notes
latest Q8_0 ~2.4GB Default
q8 Q8_0 ~2.4GB Same as latest
fp16 F16 ~4.4GB Full precision

Usage

# Default (Q8)
ollama run ahmadwaqar/smolvlm2-2.2b-instruct "Describe this image" --images photo.jpg

# Explicit Q8
ollama run ahmadwaqar/smolvlm2-2.2b-instruct:q8 "Describe this image" --images photo.jpg

# FP16 (higher quality)
ollama run ahmadwaqar/smolvlm2-2.2b-instruct:fp16 "Describe this image" --images photo.jpg

API

from ollama import Client

client = Client(host='http://localhost:11434')
response = client.chat(
    model='ahmadwaqar/smolvlm2-2.2b-instruct',  # uses Q8 by default
    messages=[{
        'role': 'user',
        'content': 'What do you see?',
        'images': ['image.png']
    }]
)
print(response['message']['content'])

Model Details

Property Value
Parameters 2.2B
Architecture SmolLM2-1.7B + SigLIP
Context 8K tokens
Variants Q8_0 (default), FP16
License Apache 2.0

Links