-
gelab-zero-4b-preview
A 4B GUI Agent for autonomous Android device control with zero-shot generalization.
vision107 Pulls 1 Tag Updated 2 weeks ago
-
smolvlm2-2.2b-instruct
SmolVLM2-2.2B-Instruct is a compact multimodal model for image and video understanding. Built on SmolLM2-1.7B with SigLIP vision encoder. Supports visual QA, OCR, and video analysis. Available in Q8 and FP16 quantizations. Apache 2.0 license.
vision63 Pulls 2 Tags Updated 2 weeks ago
-
smolvlm2-500m-video
Compact 500M vision-language model for video/image understanding. Supports visual QA, captioning, OCR, video analysis. Only 1.8GB VRAM. Built on SigLIP + SmolLM2. Available in Q8 and FP16. Apache 2.0 license.
vision53 Pulls 2 Tags Updated 2 weeks ago
-
gui-owl
GUI-Owl is a multimodal vision-language model by mPLUG/Alibaba for GUI understanding and automation. State-of-the-art on ScreenSpot, OSWorld, AndroidWorld benchmarks. Detects UI elements and automates tasks on desktop and mobile devices.
vision47 Pulls 2 Tags Updated 2 weeks ago
-
smolvlm2-agentic-gui
Lightweight 2.2B vision model for GUI automation - clicks, types, scrolls on screenshots. Fine-tuned on aguvis datasets for agentic reasoning. Available in Q8 and FP16 quantizations. Apache 2.0 license.
vision29 Pulls 3 Tags Updated 1 week ago