I speak tensors, fluently!
-
gelab-zero-4b-preview
A 4B GUI Agent for autonomous Android device control with zero-shot generalization.
vision311 Pulls 1 Tag Updated 2 months ago
-
smolvlm2-500m-video
Compact 500M vision-language model for video/image understanding. Supports visual QA, captioning, OCR, video analysis. Only 1.8GB VRAM. Built on SigLIP + SmolLM2. Available in Q8 and FP16. Apache 2.0 license.
vision217 Pulls 3 Tags Updated 1 month ago
-
mai-ui
Alibaba Tongyi GUI agent on Qwen3-VL. SOTA: 73.5% ScreenSpot-Pro, 76.7% AndroidWorld. Returns bbox [x1,y1,x2,y2] for UI automation. Supports MCP tools & device-cloud collaboration. Apache 2.0. Tags: 2b (default), 8b.
vision 2b 8b198 Pulls 3 Tags Updated 1 month ago
-
smolvlm2-2.2b-instruct
SmolVLM2-2.2B-Instruct is a compact multimodal model for image and video understanding. Built on SmolLM2-1.7B with SigLIP vision encoder. Supports visual QA, OCR, and video analysis. Available in Q8 and FP16 quantizations. Apache 2.0 license.
vision195 Pulls 2 Tags Updated 2 months ago
-
gui-owl
GUI-Owl is a multimodal vision-language model by mPLUG/Alibaba for GUI understanding and automation. State-of-the-art on ScreenSpot, OSWorld, AndroidWorld benchmarks. Detects UI elements and automates tasks on desktop and mobile devices.
vision157 Pulls 2 Tags Updated 2 months ago
-
smolvlm2-256m-video
Ultra-compact 256M vision-language model for video/image understanding. Supports visual QA, captioning, OCR, video analysis. Only 1.38GB VRAM. Built on SigLIP + SmolLM2. Available in Q8 and FP16. Apache 2.0 license.
vision115 Pulls 2 Tags Updated 1 week ago
-
smolvlm2-agentic-gui
Lightweight 2.2B vision model for GUI automation - clicks, types, scrolls on screenshots. Fine-tuned on aguvis datasets for agentic reasoning. Available in Q8 and FP16 quantizations. Apache 2.0 license.
vision85 Pulls 3 Tags Updated 1 month ago