ahmadwaqar

gelab-zero-4b-preview

A 4B GUI Agent for autonomous Android device control with zero-shot generalization.

311 Pulls 1 Tag Updated 2 months ago

smolvlm2-500m-video

Compact 500M vision-language model for video/image understanding. Supports visual QA, captioning, OCR, video analysis. Only 1.8GB VRAM. Built on SigLIP + SmolLM2. Available in Q8 and FP16. Apache 2.0 license.

vision

217 Pulls 3 Tags Updated 1 month ago

mai-ui

Alibaba Tongyi GUI agent on Qwen3-VL. SOTA: 73.5% ScreenSpot-Pro, 76.7% AndroidWorld. Returns bbox [x1,y1,x2,y2] for UI automation. Supports MCP tools & device-cloud collaboration. Apache 2.0. Tags: 2b (default), 8b.

vision 2b 8b

198 Pulls 3 Tags Updated 1 month ago

smolvlm2-2.2b-instruct

SmolVLM2-2.2B-Instruct is a compact multimodal model for image and video understanding. Built on SmolLM2-1.7B with SigLIP vision encoder. Supports visual QA, OCR, and video analysis. Available in Q8 and FP16 quantizations. Apache 2.0 license.

vision

195 Pulls 2 Tags Updated 2 months ago

gui-owl

GUI-Owl is a multimodal vision-language model by mPLUG/Alibaba for GUI understanding and automation. State-of-the-art on ScreenSpot, OSWorld, AndroidWorld benchmarks. Detects UI elements and automates tasks on desktop and mobile devices.

vision

157 Pulls 2 Tags Updated 2 months ago

smolvlm2-256m-video

Ultra-compact 256M vision-language model for video/image understanding. Supports visual QA, captioning, OCR, video analysis. Only 1.38GB VRAM. Built on SigLIP + SmolLM2. Available in Q8 and FP16. Apache 2.0 license.

vision

115 Pulls 2 Tags Updated 1 week ago

smolvlm2-agentic-gui

Lightweight 2.2B vision model for GUI automation - clicks, types, scrolls on screenshots. Fine-tuned on aguvis datasets for agentic reasoning. Available in Q8 and FP16 quantizations. Apache 2.0 license.

vision

85 Pulls 3 Tags Updated 1 month ago

I speak tensors, fluently!

gelab-zero-4b-preview

smolvlm2-500m-video

mai-ui

smolvlm2-2.2b-instruct

gui-owl

smolvlm2-256m-video

smolvlm2-agentic-gui