youtu/ youtu-vl:latest

133 1 month ago

Youtu-VL: a lightweight 4B VLM built on Youtu-LLM, pioneering VLUAS to improve visual perception and multimodal understanding.Not yet runnable. Requires Ollama with the latest llama.cpp changes integrated.

vision
ollama run youtu/youtu-vl

Details

1 month ago

639939534158 · 6.1GB ·

deepseek2
·
4.9B
·
Q8_0
clip
·
445M
·
BF16
{{- range $i, $_ := .Messages }} {{- if and (eq $i 0) (ne .Role "system") }}<|begin_of_text|>system
You are a helpful assistant.
{ "num_ctx": 4096, "repeat_penalty": 1.05, "stop": [ "<|end_of_text|>" ],

Readme

🎯 Introduction

Youtu-VL is a lightweight yet robust Vision-Language Model (VLM) built on the Youtu-LLM with 4B parameters. It pioneers Vision-Language Unified Autoregressive Supervision (VLUAS), which markedly strengthens visual perception and multimodal understanding. This enables a standard VLM to perform vision-centric tasks without task-specific additions. Across benchmarks, Youtu-VL stands out for its versatility, achieving competitive results on both vision-centric and general multimodal tasks.

✨ Key Features

  • Comprehensive Vision-Centric Capabilities: The model demonstrates strong, broad proficiency across classic vision-centric tasks, delivering competitive performance in visual grounding, image classification, object detection, referring segmentation, semantic segmentation, depth estimation, object counting, and human pose estimation.

  • Promising Performance with High Efficiency: Despite its compact 4B-parameter architecture, the model achieves competitive results across a wide range of general multimodal tasks, including general visual question answering (VQA), multimodal reasoning and mathematics, optical character recognition (OCR), multi-image and real-world understanding, hallucination evaluation, and GUI agent tasks.

⚠️ Note: Dense prediction tasks (including Segmentation and Depth Estimation) are currently NOT supported in the Ollama version. For these capabilities, please refer to the original Transformers version: Youtu-VL-4B-Instruct.