29 1 week ago

Lightweight 2.2B vision model for GUI automation - clicks, types, scrolls on screenshots. Fine-tuned on aguvis datasets for agentic reasoning. Available in Q8 and FP16 quantizations. Apache 2.0 license.

vision

Models

View all →

Readme

SmolVLM2-2.2B-Instruct-Agentic-GUI

A lightweight vision-language model fine-tuned for GUI automation and agentic tasks. This model can understand screenshots, locate UI elements, and execute multi-step interactions on desktop and mobile interfaces.

Available Variants

Tag Quantization Size Notes
latest Q8_0 ~2.5GB Default
q8 Q8_0 ~2.5GB Same as latest
fp16 F16 ~4.4GB Full precision

Model Details

Property Value
Base Model HuggingFaceTB/SmolVLM2-2.2B-Instruct
Parameters 2.2B
Variants Q8_0 (default), FP16
Context Length 8192
Vision Support Yes
License Apache 2.0

Capabilities

  • GUI Grounding: Locate and identify UI elements from screenshots
  • Agentic Reasoning: Plan and execute multi-step GUI interactions
  • Action Generation: Generate precise click, type, scroll, and drag actions
  • Cross-Platform: Works with desktop, mobile, and web interfaces

Usage

# Default (Q8)
ollama run ahmadwaqar/smolvlm2-agentic-gui "Click on the search button" --images ./screenshot.png

# FP16 (higher quality)
ollama run ahmadwaqar/smolvlm2-agentic-gui:fp16 "Click on the search button" --images ./screenshot.png

API Usage

import ollama

response = ollama.chat(
    model='ahmadwaqar/smolvlm2-agentic-gui',  # or :fp16
    messages=[{
        'role': 'user',
        'content': 'Click on the search button',
        'images': ['./screenshot.png']
    }]
)
print(response['message']['content'])

Training

This model was trained using a two-phase approach from Smol2Operator:

  1. Phase 1 - Grounding: Trained on smolagents/aguvis-stage-1 to learn UI element localization
  2. Phase 2 - Agentic Reasoning: Fine-tuned on smolagents/aguvis-stage-2 for multi-step task planning

Performance

Benchmark Score
ScreenSpot-v2 61.71%

Action Space

The model outputs actions in a unified format:

  • click(x, y) - Click at normalized coordinates [0,1]
  • type(text) - Type text input
  • scroll(direction) - Scroll up/down/left/right
  • drag(x1, y1, x2, y2) - Drag from one point to another

Example Prompts

"Click on the 'Submit' button"
"Type 'hello world' in the search field"
"What is the price shown on this page?"
"Navigate to the settings menu"

Links

Citation

@misc{smol2operator2025,
  title={Smol2Operator: Post-Training GUI Agents for Computer Use},
  author={Hugging Face Team},
  year={2025},
  url={https://huggingface.co/blog/smol2operator}
}

License

Apache 2.0