21 2 days ago

A GPT-4o Level MLLM for Single Image, Multi Image and High-FPS Video Understanding

vision
ollama run DedeProGames/AstralOCR-8b

Details

2 days ago

d60b490c2110 · 5.9GB ·

qwen3
·
8.19B
·
Q4_0
clip
·
527M
·
F16
{{- if .Messages }}{{- range $i, $_ := .Messages }}{{- $last := eq (len (slice $.Messages $i)) 1 -}}
You are a helpful assistant.
{ "num_ctx": 4096, "stop": [ "[\"<|im_start|>\",\"<|im_end|>\"]" ], "tempera

Readme

ChatGPT Image 1 de mar. de 2026, 18_23_13.png

AstralOCR is the latest and most capable OCR model in the Astral family. It is built on SigLip-400M and Qwen2-7B, totaling 8B parameters. it brings major quality gains and adds new capabilities for multi-image and video understanding.

Notable features include:

  • Leading Performance: AstralOCR reaches an average score of 65.2 on the latest OpenCompass (an evaluation spanning 8 popular benchmarks). With only 8B parameters, it can outperform widely used proprietary models such as GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet on single-image understanding.

  • Multi-Image Understanding & In-Context Learning: AstralOCR supports conversation and reasoning over multiple images. It reports state-of-the-art results on multi-image benchmarks like Mantis-Eval, BLINK, Mathverse mv, and Sciverse mv, and shows promising in-context learning behavior.

  • Strong OCR Capability: AstralOCR can handle images with any aspect ratio and up to 1.8 million pixels (e.g., 1344×1344). It reports state-of-the-art performance on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro. With RLAIF-V and VisCPM techniques, it aims for more trustworthy behavior (notably lower hallucination rates than GPT-4o/GPT-4V on Object HalBench) and supports multiple languages, including English, Chinese, German, French, Italian, Korean, and more.

  • Superior Efficiency: AstralOCR emphasizes high token density (more pixels per visual token). It produces only ~640 tokens for a 1.8M-pixel image—around 75% fewer than many alternatives—improving inference speed, first-token latency, memory usage, and power consumption.