48 2 weeks ago

GUI-Owl is a multimodal vision-language model by mPLUG/Alibaba for GUI understanding and automation. State-of-the-art on ScreenSpot, OSWorld, AndroidWorld benchmarks. Detects UI elements and automates tasks on desktop and mobile devices.

vision

Models

View all →

Readme

GUI-Owl

GUI-Owl is a multimodal vision-language model developed by mPLUG (Alibaba) as part of the Mobile-Agent-V3 project. It achieves state-of-the-art performance on GUI automation benchmarks including ScreenSpot-V2, ScreenSpot-Pro, OSWorld-G, MMBench-GUI, Android Control, Android World, and OSWorld.

Usage

ollama run ahmadwaqar/guiowl:7b-q8

ollama run ahmadwaqar/guiowl:32b-q8

Available Tags

Tag Parameters Quantization Size
7b-q8 7B Q8_0 ~8 GB
32b-q8 32B Q8_0 ~34 GB

Capabilities

  • GUI element detection and grounding
  • Screen navigation and task automation
  • Desktop and mobile UI understanding
  • Visual question answering for UI components
  • End-to-end decision making for GUI tasks

Example

ollama run ahmadwaqar/guiowl:7b-q8
>>> [attach screenshot] What button should I click to submit this form?

Model Details

Attribute Value
Developer mPLUG / Alibaba
Base Model Qwen2.5-VL-7B-Instruct / Qwen2.5-VL-32B
GGUF Quant By mradermacher
License Apache 2.0 (7B) / Qwen License (32B)
Paper arXiv:2508.15144
GitHub X-PLUG/MobileAgent

Citation

@misc{ye2025mobileagentv3,
  title={Mobile-Agent-v3: Foundamental Agents for GUI Automation},
  author={Jiabo Ye and Xi Zhang and Haiyang Xu and others},
  year={2025},
  eprint={2508.15144},
  archivePrefix={arXiv},
  primaryClass={cs.AI}
}

Credits