Lightweight 2.2B vision model for GUI automation - clicks, types, scrolls on screenshots. Fine-tuned for agentic reasoning with normalized [0,1] coordinate output. Available in Q4_K_M, Q8_0, and FP16 quantizations. Apache 2.0 license.

Details

Updated 3 months ago

3 months ago

9f1f083bdab7 · 4.5GB ·

model

archllama

parameters1.81B

quantizationF16

3.6GB

projector

archclip

parameters434M

quantizationF16

872MB

template

<|im_start|>{{ if .System }}System: {{ .System }}<end_of_utterance> {{ end }}<|im_start|>User: {{ if

217B

system

You are a helpful GUI agent. You'll be given a task and a screenshot of the screen. Complete the tas

3.4kB

license

Apache License 2.0 - https://www.apache.org/licenses/LICENSE-2.0

64B

params

{ "num_ctx": 4096, "num_predict": 512, "stop": [ "<end_of_utterance>", "

114B

SmolVLM2-2.2B-Instruct-Agentic-GUI

A lightweight vision-language model fine-tuned for GUI automation and agentic tasks. This model can understand screenshots, locate UI elements, and execute multi-step interactions on desktop and mobile interfaces.

Available Variants

Tag	Quantization	Size	Notes
`latest`	Q4_K_M	~1.1GB	Default, best speed/quality tradeoff
`q8_0`	Q8_0	~1.9GB	Higher precision
`fp16`	F16	~3.6GB	Full precision

Model Details

Property	Value
Base Model	smolagents/SmolVLM2-2.2B-Instruct-Agentic-GUI
Parameters	2.2B
Variants	Q4_K_M (default), Q8_0, FP16
Context Length	4096
Vision Support	Yes (mmproj-f16 projector)
License	Apache 2.0

Capabilities

GUI Grounding: Locate and identify UI elements from screenshots
Agentic Reasoning: Plan and execute multi-step GUI interactions
Action Generation: Generate precise click, type, scroll, and drag actions
Cross-Platform: Works with desktop, mobile, and web interfaces

Usage

# Default (Q4_K_M)
ollama run ahmadwaqar/smolvlm2-agentic-gui "Click on the search button" --images ./screenshot.png

# Q8_0 (higher precision)
ollama run ahmadwaqar/smolvlm2-agentic-gui:q8_0 "Click on the search button" --images ./screenshot.png

# FP16 (full precision)
ollama run ahmadwaqar/smolvlm2-agentic-gui:fp16 "Click on the search button" --images ./screenshot.png

API Usage

import ollama

response = ollama.chat(
    model='ahmadwaqar/smolvlm2-agentic-gui',  # or :q8_0 or :fp16
    messages=[{
        'role': 'user',
        'content': 'Click on the search button',
        'images': ['./screenshot.png']
    }]
)
print(response['message']['content'])

Training

This model was trained using a two-phase approach from Smol2Operator:

Phase 1 - Grounding: Trained on smolagents/aguvis-stage-1 to learn UI element localization
Phase 2 - Agentic Reasoning: Fine-tuned on smolagents/aguvis-stage-2 for multi-step task planning

Performance

Benchmark	Score
ScreenSpot-v2	61.71%

Action Space

The model outputs actions in normalized [0,1] coordinates:

click(x, y) - Click at normalized coordinates
double_click(x, y) - Double-click at normalized coordinates
long_press(x, y) - Long press at normalized coordinates
type(text) - Type text input
press(keys) - Press keyboard key(s) (e.g. “enter”, [“ctrl”, “c”])
scroll(direction, amount) - Scroll up or down
drag(from_coord, to_coord) - Drag from [x1, y1] to [x2, y2]
navigate_back() - Go back to previous page
wait(seconds) - Wait for specified duration

Example Prompts

"Click on the 'Submit' button"
"Type 'hello world' in the search field"
"Scroll down to see more content"
"Navigate to the settings menu"

Citation

@misc{smol2operator2025,
  title={Smol2Operator: Post-Training GUI Agents for Computer Use},
  author={Hugging Face Team},
  year={2025},
  url={https://huggingface.co/blog/smol2operator}
}

License

Apache 2.0