Alibaba Tongyi GUI agent on Qwen3-VL. SOTA: 73.5% ScreenSpot-Pro, 76.7% AndroidWorld. Returns bbox [x1,y1,x2,y2] for UI automation. Supports MCP tools & device-cloud collaboration. Apache 2.0. Tags: 2b (default), 8b.

Details

Updated 1 month ago

1 month ago

cf0ee985da1d · 4.3GB ·

model

archqwen3vl

parameters2.13B

quantizationF16

4.3GB

params

{ "repeat_penalty": 1.05, "stop": [ "<|im_start|>", "<|im_end|>", "<

148B

template

{{- if .System }}<|im_start|>system {{ .System }}<|im_end|> {{ end }}{{- if .Prompt }}<|im_start|>us

184B

MAI-UI

Foundation GUI agent by Alibaba Tongyi Lab built on Qwen3-VL for UI element detection and bounding box coordinate extraction.

Features

UI Grounding: Detects clickable elements, returns [x1, y1, x2, y2] bounding boxes
SOTA Performance: 73.5% ScreenSpot-Pro, 91.3% MMBench GUI L2, 76.7% AndroidWorld
MCP Tool Integration: Native support for external API calls
Device-Cloud Collaboration: 33% on-device boost, 40% fewer cloud calls
Apache 2.0: Fully open source

Available Tags

Tag	Parameters	Size	RAM Required
`latest`	2B	~4GB	~6GB
`2b`	2B	~4GB	~6GB
`8b`	8B	~16GB	~18GB

Usage

# Default (2B)
ollama run ahmadwaqar/mai-ui

# Explicit tags
ollama run ahmadwaqar/mai-ui:2b
ollama run ahmadwaqar/mai-ui:8b

Vision API

curl http://localhost:11434/api/chat -d '{
  "model": "ahmadwaqar/mai-ui",
  "stream": false,
  "format": "json",
  "messages": [{
    "role": "user",
    "content": "Identify all clickable UI elements with bounding boxes",
    "images": ["<BASE64_SCREENSHOT>"]
  }]
}'

Response

{
  "bbox_2d": [789, 402, 869, 437],
  "label": "forward chevron UI button"
}

Center: ((789+869)/2, (402+437)/2) = (829, 420)

Python

import ollama
import base64

with open("screenshot.png", "rb") as f:
    img = base64.b64encode(f.read()).decode()

response = ollama.chat(
    model='ahmadwaqar/mai-ui',
    format='json',
    messages=[{
        'role': 'user',
        'content': 'Identify clickable elements with bounding boxes',
        'images': [img]
    }]
)
print(response['message']['content'])

Model Details

Property	Value
Architecture	Qwen3-VL
Family	MAI-UI (2B / 8B / 32B / 235B-A22B)
Min Image Size	32x32 px
Output	JSON with `bbox_2d` coordinates
License	Apache 2.0

Benchmarks

Benchmark	Score
ScreenSpot-Pro	73.5%
MMBench GUI L2	91.3%
OSWorld-G	70.9%
UI-Vision	49.2%
AndroidWorld	76.7%
MobileWorld	41.7%

Use Cases

Mobile/desktop UI automation
Web and App testing (vision-based locators)
Accessibility detection
Screen parsing for AI agents