SmolVLM2-2.2B-Instruct is a lightweight yet powerful vision-language model that can understand images, read documents, and analyze video frames. At just 2.2B parameters, it runs efficiently on consumer hardware including laptops and smartphones, making

Details

Updated 2 months ago

2 months ago

18a7243e13d4 · 1.0GB ·

model

archllama

parameters1.81B

quantizationIQ4_XS

1.0GB

system

You are SmolVLM2, a helpful AI assistant with vision capabilities. You can understand and analyze im

116B

license

# SmolVLM2-2.2B-Instruct IQ4_XS: Ultra-Compact Vision-Language Model ## Overview SmolVLM2-2.2B-Instr

875B

params

{ "num_ctx": 8192, "stop": [ "<|im_end|>", "<|endoftext|>" ] }

75B

template

{{ if .System }}<|im_start|>system {{ .System }}<|im_end|> {{ end }}{{ if .Prompt }}<|im_start|>user

182B

SmolVLM2-2.2B-Instruct: Ultra-Compact Vision-Language Model

SmolVLM2-2.2B-Instruct is a lightweight yet powerful vision-language model that can understand images, read documents, and analyze video frames. At just 2.2B parameters, it runs efficiently on consumer hardware including laptops and smartphones, making advanced vision AI accessible to everyone.

🚀 Overview

SmolVLM2-2.2B-Instruct is a highly efficient 2.2 billion parameter vision-language model from HuggingFace, designed for image understanding, video analysis, and multimodal reasoning. Despite its compact size, it delivers impressive performance on vision tasks while running on consumer hardware.

🎯 Key Features

Ultra-compact - Only 2.2B parameters, runs on laptops and mobile devices
Vision & Video - Understands images, analyzes video frames, reads documents
Instruction-tuned - Optimized for following natural language instructions
Fast inference - Q4_K_M runs at ~30+ tokens/sec on M-series Macs
Apache 2.0 - Fully open source, no restrictions

📊 Capabilities

Image Understanding: Describe, analyze, and answer questions about images
Document OCR: Extract text and understand document layouts
Video Analysis: Process video frames for temporal understanding
Visual Reasoning: Solve problems requiring visual comprehension
Chart/Graph Reading: Interpret data visualizations

🏷️ Available Versions

Tag	Size	RAM Required	Description
`q4_k_m`	1.0 GB	~4GB	Recommended - best quality/size ratio
`q8_0`	1.8 GB	~6GB	Higher quality, minimal loss
`f16`	3.4 GB	~8GB	Full precision, maximum quality

💻 Quick Start

# Recommended version (Q4_K_M)
ollama run richardyoung/smolvlm2-2.2b-instruct "Describe this image"

# Higher quality version
ollama run richardyoung/smolvlm2-2.2b-instruct:q8_0 "What text is in this document?"

# Full precision
ollama run richardyoung/smolvlm2-2.2b-instruct:f16 "Analyze this chart"

🛠️ Example Use Cases

Image Description

ollama run richardyoung/smolvlm2-2.2b-instruct "Describe what you see in detail"

Document Analysis

ollama run richardyoung/smolvlm2-2.2b-instruct "Extract all text from this document"

Visual Q&A

ollama run richardyoung/smolvlm2-2.2b-instruct "How many people are in this photo?"

Video Understanding

ollama run richardyoung/smolvlm2-2.2b-instruct "What is happening in these video frames?"

📋 System Requirements

Minimum Requirements

RAM: 4GB
CPU: Any modern x86_64 or ARM64
Storage: 2GB free space

Recommended Setup

RAM: 8GB+
Device: Apple Silicon Mac, modern laptop, or smartphone
Storage: 5GB free space (for all quantizations)

🌟 What Makes This Model Special

Tiny Footprint: Runs on devices where larger VLMs cannot
Video Support: Native understanding of video frame sequences
Efficient Architecture: Optimized for edge deployment
Multilingual: Supports multiple languages for vision tasks
Production Ready: Battle-tested by HuggingFace team

🔗 Links

Original Model: HuggingFaceTB/SmolVLM2-2.2B-Instruct
GGUF Files: richardyoung/SmolVLM2-2.2B-Instruct-GGUF
Ollama: richardyoung/smolvlm2-2.2b-instruct

🤝 Credits

Original Model: HuggingFace Team
GGUF Conversion: Richard Young (deepneuro.ai)
Quantization: llama.cpp

📝 License

Apache 2.0 - Free for commercial and personal use.

Note: For vision tasks, use with an Ollama client that supports image input (e.g., Open WebUI, Ollama API with base64 images).