171 1 month ago

Lightweight 2.2B vision model for GUI automation - clicks, types, scrolls on screenshots. Fine-tuned for agentic reasoning with normalized [0,1] coordinate output. Available in Q4_K_M, Q8_0, and FP16 quantizations. Apache 2.0 license.

vision
ollama run ahmadwaqar/smolvlm2-agentic-gui:fp16

Details

1 month ago

9f1f083bdab7 · 4.5GB ·

llama
·
1.81B
·
F16
clip
·
434M
·
F16
<|im_start|>{{ if .System }}System: {{ .System }}<end_of_utterance> {{ end }}<|im_start|>User: {{ if
You are a helpful GUI agent. You'll be given a task and a screenshot of the screen. Complete the tas
Apache License 2.0 - https://www.apache.org/licenses/LICENSE-2.0
{ "num_ctx": 4096, "num_predict": 512, "stop": [ "<end_of_utterance>", "

Readme

SmolVLM2-2.2B-Instruct-Agentic-GUI

A lightweight vision-language model fine-tuned for GUI automation and agentic tasks. This model can understand screenshots, locate UI elements, and execute multi-step interactions on desktop and mobile interfaces.

Available Variants

Tag Quantization Size Notes
latest Q4_K_M ~1.1GB Default, best speed/quality tradeoff
q8_0 Q8_0 ~1.9GB Higher precision
fp16 F16 ~3.6GB Full precision

Model Details

Property Value
Base Model smolagents/SmolVLM2-2.2B-Instruct-Agentic-GUI
Parameters 2.2B
Variants Q4_K_M (default), Q8_0, FP16
Context Length 4096
Vision Support Yes (mmproj-f16 projector)
License Apache 2.0

Capabilities

  • GUI Grounding: Locate and identify UI elements from screenshots
  • Agentic Reasoning: Plan and execute multi-step GUI interactions
  • Action Generation: Generate precise click, type, scroll, and drag actions
  • Cross-Platform: Works with desktop, mobile, and web interfaces

Usage

# Default (Q4_K_M)
ollama run ahmadwaqar/smolvlm2-agentic-gui "Click on the search button" --images ./screenshot.png

# Q8_0 (higher precision)
ollama run ahmadwaqar/smolvlm2-agentic-gui:q8_0 "Click on the search button" --images ./screenshot.png

# FP16 (full precision)
ollama run ahmadwaqar/smolvlm2-agentic-gui:fp16 "Click on the search button" --images ./screenshot.png

API Usage

import ollama

response = ollama.chat(
    model='ahmadwaqar/smolvlm2-agentic-gui',  # or :q8_0 or :fp16
    messages=[{
        'role': 'user',
        'content': 'Click on the search button',
        'images': ['./screenshot.png']
    }]
)
print(response['message']['content'])

Training

This model was trained using a two-phase approach from Smol2Operator:

  1. Phase 1 - Grounding: Trained on smolagents/aguvis-stage-1 to learn UI element localization
  2. Phase 2 - Agentic Reasoning: Fine-tuned on smolagents/aguvis-stage-2 for multi-step task planning

Performance

Benchmark Score
ScreenSpot-v2 61.71%

Action Space

The model outputs actions in normalized [0,1] coordinates:

  • click(x, y) - Click at normalized coordinates
  • double_click(x, y) - Double-click at normalized coordinates
  • long_press(x, y) - Long press at normalized coordinates
  • type(text) - Type text input
  • press(keys) - Press keyboard key(s) (e.g. “enter”, [“ctrl”, “c”])
  • scroll(direction, amount) - Scroll up or down
  • drag(from_coord, to_coord) - Drag from [x1, y1] to [x2, y2]
  • navigate_back() - Go back to previous page
  • wait(seconds) - Wait for specified duration

Example Prompts

"Click on the 'Submit' button"
"Type 'hello world' in the search field"
"Scroll down to see more content"
"Navigate to the settings menu"

Links

Citation

@misc{smol2operator2025,
  title={Smol2Operator: Post-Training GUI Agents for Computer Use},
  author={Hugging Face Team},
  year={2025},
  url={https://huggingface.co/blog/smol2operator}
}

License

Apache 2.0