187 1 month ago

Alibaba Tongyi GUI agent on Qwen3-VL. SOTA: 73.5% ScreenSpot-Pro, 76.7% AndroidWorld. Returns bbox [x1,y1,x2,y2] for UI automation. Supports MCP tools & device-cloud collaboration. Apache 2.0. Tags: 2b (default), 8b.

vision 2b 8b
ollama run ahmadwaqar/mai-ui:2b

Details

1 month ago

cf0ee985da1d · 4.3GB ·

qwen3vl
·
2.13B
·
F16
{ "repeat_penalty": 1.05, "stop": [ "<|im_start|>", "<|im_end|>", "<
{{- if .System }}<|im_start|>system {{ .System }}<|im_end|> {{ end }}{{- if .Prompt }}<|im_start|>us

Readme

MAI-UI

Foundation GUI agent by Alibaba Tongyi Lab built on Qwen3-VL for UI element detection and bounding box coordinate extraction.

Features

  • UI Grounding: Detects clickable elements, returns [x1, y1, x2, y2] bounding boxes
  • SOTA Performance: 73.5% ScreenSpot-Pro, 91.3% MMBench GUI L2, 76.7% AndroidWorld
  • MCP Tool Integration: Native support for external API calls
  • Device-Cloud Collaboration: 33% on-device boost, 40% fewer cloud calls
  • Apache 2.0: Fully open source

Available Tags

Tag Parameters Size RAM Required
latest 2B ~4GB ~6GB
2b 2B ~4GB ~6GB
8b 8B ~16GB ~18GB

Usage

# Default (2B)
ollama run ahmadwaqar/mai-ui

# Explicit tags
ollama run ahmadwaqar/mai-ui:2b
ollama run ahmadwaqar/mai-ui:8b

Vision API

curl http://localhost:11434/api/chat -d '{
  "model": "ahmadwaqar/mai-ui",
  "stream": false,
  "format": "json",
  "messages": [{
    "role": "user",
    "content": "Identify all clickable UI elements with bounding boxes",
    "images": ["<BASE64_SCREENSHOT>"]
  }]
}'

Response

{
  "bbox_2d": [789, 402, 869, 437],
  "label": "forward chevron UI button"
}

Center: ((789+869)/2, (402+437)/2) = (829, 420)

Python

import ollama
import base64

with open("screenshot.png", "rb") as f:
    img = base64.b64encode(f.read()).decode()

response = ollama.chat(
    model='ahmadwaqar/mai-ui',
    format='json',
    messages=[{
        'role': 'user',
        'content': 'Identify clickable elements with bounding boxes',
        'images': [img]
    }]
)
print(response['message']['content'])

Model Details

Property Value
Architecture Qwen3-VL
Family MAI-UI (2B / 8B / 32B / 235B-A22B)
Min Image Size 32x32 px
Output JSON with bbox_2d coordinates
License Apache 2.0

Benchmarks

Benchmark Score
ScreenSpot-Pro 73.5%
MMBench GUI L2 91.3%
OSWorld-G 70.9%
UI-Vision 49.2%
AndroidWorld 76.7%
MobileWorld 41.7%

Use Cases

  • Mobile/desktop UI automation
  • Web and App testing (vision-based locators)
  • Accessibility detection
  • Screen parsing for AI agents

Links