21 2 days ago

A GPT-4o Level MLLM for Single Image, Multi Image and High-FPS Video Understanding

vision
ollama run DedeProGames/AstralOCR-8b

Models

View all →

Readme

ChatGPT Image 1 de mar. de 2026, 18_23_13.png

AstralOCR is the latest and most capable OCR model in the Astral family. It is built on SigLip-400M and Qwen2-7B, totaling 8B parameters. it brings major quality gains and adds new capabilities for multi-image and video understanding.

Notable features include:

  • Leading Performance: AstralOCR reaches an average score of 65.2 on the latest OpenCompass (an evaluation spanning 8 popular benchmarks). With only 8B parameters, it can outperform widely used proprietary models such as GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet on single-image understanding.

  • Multi-Image Understanding & In-Context Learning: AstralOCR supports conversation and reasoning over multiple images. It reports state-of-the-art results on multi-image benchmarks like Mantis-Eval, BLINK, Mathverse mv, and Sciverse mv, and shows promising in-context learning behavior.

  • Strong OCR Capability: AstralOCR can handle images with any aspect ratio and up to 1.8 million pixels (e.g., 1344×1344). It reports state-of-the-art performance on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro. With RLAIF-V and VisCPM techniques, it aims for more trustworthy behavior (notably lower hallucination rates than GPT-4o/GPT-4V on Object HalBench) and supports multiple languages, including English, Chinese, German, French, Italian, Korean, and more.

  • Superior Efficiency: AstralOCR emphasizes high token density (more pixels per visual token). It produces only ~640 tokens for a 1.8M-pixel image—around 75% fewer than many alternatives—improving inference speed, first-token latency, memory usage, and power consumption.