DedeProGames/AstralOCR-8b

DedeProGames/ AstralOCR-8b

231 Downloads Updated 3 months ago

A GPT-4o Level MLLM for Single Image, Multi Image and High-FPS Video Understanding

vision

ollama run DedeProGames/AstralOCR-8b

curl http://localhost:11434/api/chat \
  -d '{
    "model": "DedeProGames/AstralOCR-8b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from ollama import chat

response = chat(
    model='DedeProGames/AstralOCR-8b',
    messages=[{'role': 'user', 'content': 'Hello!'}],
)
print(response.message.content)

import ollama from 'ollama'

const response = await ollama.chat({
  model: 'DedeProGames/AstralOCR-8b',
  messages: [{role: 'user', content: 'Hello!'}],
})
console.log(response.message.content)

Models

Name

1 model

Size / Usage

Context

Input

AstralOCR-8b:latest

5.9GB · 40K context window · Text, Image · 3 months ago

AstralOCR-8b:latest

5.9GB

40K

Text, Image

Readme

AstralOCR is the latest and most capable OCR model in the Astral family. It is built on SigLip-400M and Qwen2-7B, totaling 8B parameters. it brings major quality gains and adds new capabilities for multi-image and video understanding.

Notable features include:

Leading Performance: AstralOCR reaches an average score of 65.2 on the latest OpenCompass (an evaluation spanning 8 popular benchmarks). With only 8B parameters, it can outperform widely used proprietary models such as GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet on single-image understanding.
Multi-Image Understanding & In-Context Learning: AstralOCR supports conversation and reasoning over multiple images. It reports state-of-the-art results on multi-image benchmarks like Mantis-Eval, BLINK, Mathverse mv, and Sciverse mv, and shows promising in-context learning behavior.
Strong OCR Capability: AstralOCR can handle images with any aspect ratio and up to 1.8 million pixels (e.g., 1344×1344). It reports state-of-the-art performance on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro. With RLAIF-V and VisCPM techniques, it aims for more trustworthy behavior (notably lower hallucination rates than GPT-4o/GPT-4V on Object HalBench) and supports multiple languages, including English, Chinese, German, French, Italian, Korean, and more.
Superior Efficiency: AstralOCR emphasizes high token density (more pixels per visual token). It produces only ~640 tokens for a 1.8M-pixel image—around 75% fewer than many alternatives—improving inference speed, first-token latency, memory usage, and power consumption.

![ChatGPT Image 1 de mar. de 2026, 18_23_13.png](/assets/DedeProGames/AstralOCR-8b/f22b6cb8-6d43-42d9-a016-833d4c39ed7b)

AstralOCR is the latest and most capable OCR model in the Astral family. It is built on SigLip-400M and Qwen2-7B, totaling 8B parameters. it brings major quality gains and adds new capabilities for multi-image and video understanding.

Notable features include:

- **Leading Performance**: AstralOCR reaches an average score of 65.2 on the latest OpenCompass (an evaluation spanning 8 popular benchmarks). With only 8B parameters, it can outperform widely used proprietary models such as GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet on single-image understanding.

- **Multi-Image Understanding & In-Context Learning**: AstralOCR supports conversation and reasoning over multiple images. It reports state-of-the-art results on multi-image benchmarks like Mantis-Eval, BLINK, Mathverse mv, and Sciverse mv, and shows promising in-context learning behavior.

- **Strong OCR Capability**: AstralOCR can handle images with any aspect ratio and up to 1.8 million pixels (e.g., 1344×1344). It reports state-of-the-art performance on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro. With RLAIF-V and VisCPM techniques, it aims for more trustworthy behavior (notably lower hallucination rates than GPT-4o/GPT-4V on Object HalBench) and supports multiple languages, including English, Chinese, German, French, Italian, Korean, and more.

- **Superior Efficiency**: AstralOCR emphasizes high token density (more pixels per visual token). It produces only ~640 tokens for a 1.8M-pixel image—around 75% fewer than many alternatives—improving inference speed, first-token latency, memory usage, and power consumption.

Paste, drop or click to upload images (.png, .jpeg, .jpg, .svg, .gif)