Lightweight 2.2B vision model for GUI automation - clicks, types, scrolls on screenshots. Fine-tuned on aguvis datasets for agentic reasoning. Available in Q8 and FP16 quantizations. Apache 2.0 license.

SmolVLM2-2.2B-Instruct-Agentic-GUI

A lightweight vision-language model fine-tuned for GUI automation and agentic tasks. This model can understand screenshots, locate UI elements, and execute multi-step interactions on desktop and mobile interfaces.

Available Variants

Tag	Quantization	Size	Notes
`latest`	Q8_0	~2.5GB	Default
`q8`	Q8_0	~2.5GB	Same as latest
`fp16`	F16	~4.4GB	Full precision

Model Details

Property	Value
Base Model	HuggingFaceTB/SmolVLM2-2.2B-Instruct
Parameters	2.2B
Variants	Q8_0 (default), FP16
Context Length	8192
Vision Support	Yes
License	Apache 2.0

Capabilities

GUI Grounding: Locate and identify UI elements from screenshots
Agentic Reasoning: Plan and execute multi-step GUI interactions
Action Generation: Generate precise click, type, scroll, and drag actions
Cross-Platform: Works with desktop, mobile, and web interfaces

Usage

# Default (Q8)
ollama run ahmadwaqar/smolvlm2-agentic-gui "Click on the search button" --images ./screenshot.png

# FP16 (higher quality)
ollama run ahmadwaqar/smolvlm2-agentic-gui:fp16 "Click on the search button" --images ./screenshot.png

API Usage

import ollama

response = ollama.chat(
    model='ahmadwaqar/smolvlm2-agentic-gui',  # or :fp16
    messages=[{
        'role': 'user',
        'content': 'Click on the search button',
        'images': ['./screenshot.png']
    }]
)
print(response['message']['content'])

Training

This model was trained using a two-phase approach from Smol2Operator:

Phase 1 - Grounding: Trained on smolagents/aguvis-stage-1 to learn UI element localization
Phase 2 - Agentic Reasoning: Fine-tuned on smolagents/aguvis-stage-2 for multi-step task planning

Performance

Benchmark	Score
ScreenSpot-v2	61.71%

Action Space

The model outputs actions in a unified format:

click(x, y) - Click at normalized coordinates [0,1]
type(text) - Type text input
scroll(direction) - Scroll up/down/left/right
drag(x1, y1, x2, y2) - Drag from one point to another

Example Prompts

"Click on the 'Submit' button"
"Type 'hello world' in the search field"
"What is the price shown on this page?"
"Navigate to the settings menu"

Citation

@misc{smol2operator2025,
  title={Smol2Operator: Post-Training GUI Agents for Computer Use},
  author={Hugging Face Team},
  year={2025},
  url={https://huggingface.co/blog/smol2operator}
}

License

Apache 2.0