Granite Docling is a multimodal Image-Text-to-Text model engineered for efficient document conversion.

Granite Docling

Granite Docling is a multimodal Image-Text-to-Text model engineered for efficient document conversion. It preserves the core features of Docling while maintaining seamless integration with DoclingDocuments to ensure full compatibility.

Model Summary

Granite Docling 258M builds upon the Idefics3 architecture, but introduces two key modifications: it replaces the vision encoder with siglip2-base-patch16-512 and substitutes the language model with a Granite 165M LLM.

Granite-docling-258M is fully integrated into the Docling pipelines, which leverages all its capabilities for a one-shot prediction of all Docling features.

Features

🏷️ DocTags for Efficient Tokenization – Introduces DocTags an efficient and minimal representation for documents that is fully compatible with DoclingDocuments.
🔍 OCR (Optical Character Recognition) – Extracts text accurately from images.
📐 Layout and Localization – Preserves document structure and document element bounding boxes.
💻 Code Recognition – Detects and formats code blocks including indentation.
🔢 Formula Recognition – [Enhanced] Identifies and processes mathematical expressions.
🧮 Inline Equations – Better inline math recognition
📊 Chart Recognition – Extracts and interprets chart data.
📑 Table Recognition – Supports column and row headers for structured table extraction.
🖼️ Figure Classification – Differentiates figures and graphical elements.
📝 Caption Correspondence – Links captions to relevant images and figures.
📜 List Grouping – Organizes and structures list elements correctly.
📄 Full-Page Conversion – Processes entire pages for comprehensive document conversion including all page elements (code, equations, tables, charts etc.)
🧩 Flexible Inference Modes – Choose between full-page inference, bbox-guided region inference
📂 General Document Processing – Trained for both scientific and non-scientific documents.
🧾 Document Element QA – Answer questions about a document’s structure such as the presence and order of document elements
🌍 Multi-language – Japanese, Arabic and Chinese support (experimental)
💨 Fast inference using VLLM – Avg of 0.35 secs per page on A100 GPU.

Intended Use

Granite-Docling is designed to complement the Docling library, not replace it. It integrates as a component within the larger Docling library, consolidating the functions of multiple single-purpose models into a single, compact VLM. However, Granite-Docling is not intended for general image understanding. For tasks focused solely on image-text input, we recommend using Granite Vision models, which are purpose-built and optimized for image-text processing.

Supported Instructions

Description	Instruction	Short Instruction
Full conversion	Convert this page to docling.	-
Chart	Convert chart to table.	`<chart>`
Formula	Convert formula to LaTeX.	`<formula>`
Code	Convert code to text.	`<code>`
Table	Convert table to OTSL. (Lysak et al., 2023)	`<otsl>`
Actions and Pipelines	OCR the text in a specific location: <loc_155><loc_233><loc_206><loc_237>	-
	Identify element at: <loc_247><loc_482><loc_252><loc_486>	-
	Find all 'text' elements on the page, retrieve all section headers.	-
	Detect footer elements on the page.	-

Learn more

Developers: IBM Research
Website: Docling
Model: ibm-granite/granite-docling-258M
GitHub Repository: docling-project/docling
Release Date: September 17, 2025
License: Apache 2.0