73 1 week ago

SmolVLM2-2.2B-Instruct is a lightweight yet powerful vision-language model that can understand images, read documents, and analyze video frames. At just 2.2B parameters, it runs efficiently on consumer hardware including laptops and smartphones, making

1 week ago

18a7243e13d4 ยท 1.0GB ยท

llama
ยท
1.81B
ยท
IQ4_XS
You are SmolVLM2, a helpful AI assistant with vision capabilities. You can understand and analyze im
# SmolVLM2-2.2B-Instruct IQ4_XS: Ultra-Compact Vision-Language Model ## Overview SmolVLM2-2.2B-Instr
{ "num_ctx": 8192, "stop": [ "<|im_end|>", "<|endoftext|>" ] }
{{ if .System }}<|im_start|>system {{ .System }}<|im_end|> {{ end }}{{ if .Prompt }}<|im_start|>user

Readme

SmolVLM2-2.2B-Instruct: Ultra-Compact Vision-Language Model

SmolVLM2-2.2B-Instruct is a lightweight yet powerful vision-language model that can understand images, read documents, and analyze video frames. At just 2.2B parameters, it runs efficiently on consumer hardware including laptops and smartphones, making advanced vision AI accessible to everyone.

๐Ÿš€ Overview

SmolVLM2-2.2B-Instruct is a highly efficient 2.2 billion parameter vision-language model from HuggingFace, designed for image understanding, video analysis, and multimodal reasoning. Despite its compact size, it delivers impressive performance on vision tasks while running on consumer hardware.

๐ŸŽฏ Key Features

  • Ultra-compact - Only 2.2B parameters, runs on laptops and mobile devices
  • Vision & Video - Understands images, analyzes video frames, reads documents
  • Instruction-tuned - Optimized for following natural language instructions
  • Fast inference - Q4_K_M runs at ~30+ tokens/sec on M-series Macs
  • Apache 2.0 - Fully open source, no restrictions

๐Ÿ“Š Capabilities

  • Image Understanding: Describe, analyze, and answer questions about images
  • Document OCR: Extract text and understand document layouts
  • Video Analysis: Process video frames for temporal understanding
  • Visual Reasoning: Solve problems requiring visual comprehension
  • Chart/Graph Reading: Interpret data visualizations

๐Ÿท๏ธ Available Versions

Tag Size RAM Required Description
q4_k_m 1.0 GB ~4GB Recommended - best quality/size ratio
q8_0 1.8 GB ~6GB Higher quality, minimal loss
f16 3.4 GB ~8GB Full precision, maximum quality

๐Ÿ’ป Quick Start

# Recommended version (Q4_K_M)
ollama run richardyoung/smolvlm2-2.2b-instruct "Describe this image"

# Higher quality version
ollama run richardyoung/smolvlm2-2.2b-instruct:q8_0 "What text is in this document?"

# Full precision
ollama run richardyoung/smolvlm2-2.2b-instruct:f16 "Analyze this chart"

๐Ÿ› ๏ธ Example Use Cases

Image Description

ollama run richardyoung/smolvlm2-2.2b-instruct "Describe what you see in detail"

Document Analysis

ollama run richardyoung/smolvlm2-2.2b-instruct "Extract all text from this document"

Visual Q&A

ollama run richardyoung/smolvlm2-2.2b-instruct "How many people are in this photo?"

Video Understanding

ollama run richardyoung/smolvlm2-2.2b-instruct "What is happening in these video frames?"

๐Ÿ“‹ System Requirements

Minimum Requirements

  • RAM: 4GB
  • CPU: Any modern x86_64 or ARM64
  • Storage: 2GB free space

Recommended Setup

  • RAM: 8GB+
  • Device: Apple Silicon Mac, modern laptop, or smartphone
  • Storage: 5GB free space (for all quantizations)

๐ŸŒŸ What Makes This Model Special

  1. Tiny Footprint: Runs on devices where larger VLMs cannot
  2. Video Support: Native understanding of video frame sequences
  3. Efficient Architecture: Optimized for edge deployment
  4. Multilingual: Supports multiple languages for vision tasks
  5. Production Ready: Battle-tested by HuggingFace team

๐Ÿ”— Links

๐Ÿค Credits

  • Original Model: HuggingFace Team
  • GGUF Conversion: Richard Young (deepneuro.ai)
  • Quantization: llama.cpp

๐Ÿ“ License

Apache 2.0 - Free for commercial and personal use.


Note: For vision tasks, use with an Ollama client that supports image input (e.g., Open WebUI, Ollama API with base64 images).