133 Downloads Updated 1 month ago
ollama run youtu/youtu-vl
๐ License โข ๐ป Code โข ๐ Technical Report โข ๐ Benchmarks โข ๐ Getting Started
Youtu-VL is a lightweight yet robust Vision-Language Model (VLM) built on the Youtu-LLM with 4B parameters. It pioneers Vision-Language Unified Autoregressive Supervision (VLUAS), which markedly strengthens visual perception and multimodal understanding. This enables a standard VLM to perform vision-centric tasks without task-specific additions. Across benchmarks, Youtu-VL stands out for its versatility, achieving competitive results on both vision-centric and general multimodal tasks.
Comprehensive Vision-Centric Capabilities: The model demonstrates strong, broad proficiency across classic vision-centric tasks, delivering competitive performance in visual grounding, image classification, object detection, referring segmentation, semantic segmentation, depth estimation, object counting, and human pose estimation.
Promising Performance with High Efficiency: Despite its compact 4B-parameter architecture, the model achieves competitive results across a wide range of general multimodal tasks, including general visual question answering (VQA), multimodal reasoning and mathematics, optical character recognition (OCR), multi-image and real-world understanding, hallucination evaluation, and GUI agent tasks.
โ ๏ธ Note: Dense prediction tasks (including Segmentation and Depth Estimation) are currently NOT supported in the Ollama version. For these capabilities, please refer to the original Transformers version: Youtu-VL-4B-Instruct.