133 1 month ago

Youtu-VL: a lightweight 4B VLM built on Youtu-LLM, pioneering VLUAS to improve visual perception and multimodal understanding.Not yet runnable. Requires Ollama with the latest llama.cpp changes integrated.

vision
ollama run youtu/youtu-vl

Models

View all →

Readme

๐ŸŽฏ Introduction

Youtu-VL is a lightweight yet robust Vision-Language Model (VLM) built on the Youtu-LLM with 4B parameters. It pioneers Vision-Language Unified Autoregressive Supervision (VLUAS), which markedly strengthens visual perception and multimodal understanding. This enables a standard VLM to perform vision-centric tasks without task-specific additions. Across benchmarks, Youtu-VL stands out for its versatility, achieving competitive results on both vision-centric and general multimodal tasks.

โœจ Key Features

  • Comprehensive Vision-Centric Capabilities: The model demonstrates strong, broad proficiency across classic vision-centric tasks, delivering competitive performance in visual grounding, image classification, object detection, referring segmentation, semantic segmentation, depth estimation, object counting, and human pose estimation.

  • Promising Performance with High Efficiency: Despite its compact 4B-parameter architecture, the model achieves competitive results across a wide range of general multimodal tasks, including general visual question answering (VQA), multimodal reasoning and mathematics, optical character recognition (OCR), multi-image and real-world understanding, hallucination evaluation, and GUI agent tasks.

โš ๏ธ Note: Dense prediction tasks (including Segmentation and Depth Estimation) are currently NOT supported in the Ollama version. For these capabilities, please refer to the original Transformers version: Youtu-VL-4B-Instruct.