29 Downloads Updated 1 week ago
A lightweight vision-language model fine-tuned for GUI automation and agentic tasks. This model can understand screenshots, locate UI elements, and execute multi-step interactions on desktop and mobile interfaces.
| Tag | Quantization | Size | Notes |
|---|---|---|---|
latest |
Q8_0 | ~2.5GB | Default |
q8 |
Q8_0 | ~2.5GB | Same as latest |
fp16 |
F16 | ~4.4GB | Full precision |
| Property | Value |
|---|---|
| Base Model | HuggingFaceTB/SmolVLM2-2.2B-Instruct |
| Parameters | 2.2B |
| Variants | Q8_0 (default), FP16 |
| Context Length | 8192 |
| Vision Support | Yes |
| License | Apache 2.0 |
# Default (Q8)
ollama run ahmadwaqar/smolvlm2-agentic-gui "Click on the search button" --images ./screenshot.png
# FP16 (higher quality)
ollama run ahmadwaqar/smolvlm2-agentic-gui:fp16 "Click on the search button" --images ./screenshot.png
import ollama
response = ollama.chat(
model='ahmadwaqar/smolvlm2-agentic-gui', # or :fp16
messages=[{
'role': 'user',
'content': 'Click on the search button',
'images': ['./screenshot.png']
}]
)
print(response['message']['content'])
This model was trained using a two-phase approach from Smol2Operator:
| Benchmark | Score |
|---|---|
| ScreenSpot-v2 | 61.71% |
The model outputs actions in a unified format:
click(x, y) - Click at normalized coordinates [0,1]type(text) - Type text inputscroll(direction) - Scroll up/down/left/rightdrag(x1, y1, x2, y2) - Drag from one point to another"Click on the 'Submit' button"
"Type 'hello world' in the search field"
"What is the price shown on this page?"
"Navigate to the settings menu"
@misc{smol2operator2025,
title={Smol2Operator: Post-Training GUI Agents for Computer Use},
author={Hugging Face Team},
year={2025},
url={https://huggingface.co/blog/smol2operator}
}
Apache 2.0