youtu/youtu-vl

youtu/ youtu-vl:latest

133 Downloads Updated 1 month ago

Youtu-VL: a lightweight 4B VLM built on Youtu-LLM, pioneering VLUAS to improve visual perception and multimodal understanding.Not yet runnable. Requires Ollama with the latest llama.cpp changes integrated.

vision

ollama run youtu/youtu-vl

curl http://localhost:11434/api/chat \
  -d '{
    "model": "youtu/youtu-vl",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from ollama import chat

response = chat(
    model='youtu/youtu-vl',
    messages=[{'role': 'user', 'content': 'Hello!'}],
)
print(response.message.content)

import ollama from 'ollama'

const response = await ollama.chat({
  model: 'youtu/youtu-vl',
  messages: [{role: 'user', content: 'Hello!'}],
})
console.log(response.message.content)

Details

Updated 1 month ago

1 month ago

639939534158 · 6.1GB ·

model

archdeepseek2

parameters4.9B

quantizationQ8_0

5.2GB

projector

archclip

parameters445M

quantizationBF16

893MB

template

{{- range $i, $_ := .Messages }} {{- if and (eq $i 0) (ne .Role "system") }}<|begin_of_text|>system

771B

system

You are a helpful assistant.

28B

params

{ "num_ctx": 4096, "repeat_penalty": 1.05, "stop": [ "<|end_of_text|>" ],

108B

Readme

📃 License • 💻 Code • 📑 Technical Report • 📊 Benchmarks • 🚀 Getting Started

🎯 Introduction

Youtu-VL is a lightweight yet robust Vision-Language Model (VLM) built on the Youtu-LLM with 4B parameters. It pioneers Vision-Language Unified Autoregressive Supervision (VLUAS), which markedly strengthens visual perception and multimodal understanding. This enables a standard VLM to perform vision-centric tasks without task-specific additions. Across benchmarks, Youtu-VL stands out for its versatility, achieving competitive results on both vision-centric and general multimodal tasks.

✨ Key Features

Comprehensive Vision-Centric Capabilities: The model demonstrates strong, broad proficiency across classic vision-centric tasks, delivering competitive performance in visual grounding, image classification, object detection, referring segmentation, semantic segmentation, depth estimation, object counting, and human pose estimation.
Promising Performance with High Efficiency: Despite its compact 4B-parameter architecture, the model achieves competitive results across a wide range of general multimodal tasks, including general visual question answering (VQA), multimodal reasoning and mathematics, optical character recognition (OCR), multi-image and real-world understanding, hallucination evaluation, and GUI agent tasks.

⚠️ Note: Dense prediction tasks (including Segmentation and Depth Estimation) are currently NOT supported in the Ollama version. For these capabilities, please refer to the original Transformers version: Youtu-VL-4B-Instruct.

#  <img src="/assets/youtu/youtu-vl/6abb9d33-6a27-44d9-b638-42b9a01824c1" alt="Youtu-VL Logo" height="100px">

[📃 License](LICENSE.txt) • [💻 Code](https://github.com/TencentCloudADP/youtu-vl) • [📑 Technical Report](https://arxiv.org/abs/2601.19798) • [📊 Benchmarks](#benchmarks) • [🚀 Getting Started](#quickstart)
</div>

## 🎯 Introduction

**Youtu-VL** is a lightweight yet robust Vision-Language Model (VLM) built on the Youtu-LLM with 4B parameters. It pioneers Vision-Language Unified Autoregressive Supervision (VLUAS), which markedly strengthens visual perception and multimodal understanding. This enables a standard VLM to perform vision-centric tasks without task-specific additions. Across benchmarks, Youtu-VL stands out for its versatility, achieving competitive results on both vision-centric and general multimodal tasks.

## ✨ Key Features

- **Comprehensive Vision-Centric Capabilities**: The model demonstrates strong, broad proficiency across classic vision-centric tasks, delivering competitive performance in visual grounding, image classification, object detection, referring segmentation, semantic segmentation, depth estimation, object counting, and human pose estimation.

- **Promising Performance with High Efficiency**: Despite its compact 4B-parameter architecture, the model achieves competitive results across a wide range of general multimodal tasks, including general visual question answering (VQA), multimodal reasoning and mathematics, optical character recognition (OCR), multi-image and real-world understanding, hallucination evaluation, and GUI agent tasks.

⚠️ **Note:** Dense prediction tasks (including Segmentation and Depth Estimation) are currently **NOT** supported in the `Ollama` version. For these capabilities, please refer to the original Transformers version: [Youtu-VL-4B-Instruct](https://huggingface.co/tencent/Youtu-VL-4B-Instruct).

Paste, drop or click to upload images (.png, .jpeg, .jpg, .svg, .gif)