73 1 week ago

SmolVLM2-2.2B-Instruct is a lightweight yet powerful vision-language model that can understand images, read documents, and analyze video frames. At just 2.2B parameters, it runs efficiently on consumer hardware including laptops and smartphones, making

Models

View all →

Readme

SmolVLM2-2.2B-Instruct: Ultra-Compact Vision-Language Model

SmolVLM2-2.2B-Instruct is a lightweight yet powerful vision-language model that can understand images, read documents, and analyze video frames. At just 2.2B parameters, it runs efficiently on consumer hardware including laptops and smartphones, making advanced vision AI accessible to everyone.

🚀 Overview

SmolVLM2-2.2B-Instruct is a highly efficient 2.2 billion parameter vision-language model from HuggingFace, designed for image understanding, video analysis, and multimodal reasoning. Despite its compact size, it delivers impressive performance on vision tasks while running on consumer hardware.

🎯 Key Features

  • Ultra-compact - Only 2.2B parameters, runs on laptops and mobile devices
  • Vision & Video - Understands images, analyzes video frames, reads documents
  • Instruction-tuned - Optimized for following natural language instructions
  • Fast inference - Q4_K_M runs at ~30+ tokens/sec on M-series Macs
  • Apache 2.0 - Fully open source, no restrictions

📊 Capabilities

  • Image Understanding: Describe, analyze, and answer questions about images
  • Document OCR: Extract text and understand document layouts
  • Video Analysis: Process video frames for temporal understanding
  • Visual Reasoning: Solve problems requiring visual comprehension
  • Chart/Graph Reading: Interpret data visualizations

🏷️ Available Versions

Tag Size RAM Required Description
q4_k_m 1.0 GB ~4GB Recommended - best quality/size ratio
q8_0 1.8 GB ~6GB Higher quality, minimal loss
f16 3.4 GB ~8GB Full precision, maximum quality

💻 Quick Start

# Recommended version (Q4_K_M)
ollama run richardyoung/smolvlm2-2.2b-instruct "Describe this image"

# Higher quality version
ollama run richardyoung/smolvlm2-2.2b-instruct:q8_0 "What text is in this document?"

# Full precision
ollama run richardyoung/smolvlm2-2.2b-instruct:f16 "Analyze this chart"

🛠️ Example Use Cases

Image Description

ollama run richardyoung/smolvlm2-2.2b-instruct "Describe what you see in detail"

Document Analysis

ollama run richardyoung/smolvlm2-2.2b-instruct "Extract all text from this document"

Visual Q&A

ollama run richardyoung/smolvlm2-2.2b-instruct "How many people are in this photo?"

Video Understanding

ollama run richardyoung/smolvlm2-2.2b-instruct "What is happening in these video frames?"

📋 System Requirements

Minimum Requirements

  • RAM: 4GB
  • CPU: Any modern x86_64 or ARM64
  • Storage: 2GB free space

Recommended Setup

  • RAM: 8GB+
  • Device: Apple Silicon Mac, modern laptop, or smartphone
  • Storage: 5GB free space (for all quantizations)

🌟 What Makes This Model Special

  1. Tiny Footprint: Runs on devices where larger VLMs cannot
  2. Video Support: Native understanding of video frame sequences
  3. Efficient Architecture: Optimized for edge deployment
  4. Multilingual: Supports multiple languages for vision tasks
  5. Production Ready: Battle-tested by HuggingFace team

🔗 Links

🤝 Credits

  • Original Model: HuggingFace Team
  • GGUF Conversion: Richard Young (deepneuro.ai)
  • Quantization: llama.cpp

📝 License

Apache 2.0 - Free for commercial and personal use.


Note: For vision tasks, use with an Ollama client that supports image input (e.g., Open WebUI, Ollama API with base64 images).