520 3 weeks ago

Granite Docling is a multimodal Image-Text-to-Text model engineered for efficient document conversion.

vision 258m

Models

View all →

Readme

Granite Docling

Granite Docling is a multimodal Image-Text-to-Text model engineered for efficient document conversion. It preserves the core features of Docling while maintaining seamless integration with DoclingDocuments to ensure full compatibility.

Model Summary

Granite Docling 258M builds upon the Idefics3 architecture, but introduces two key modifications: it replaces the vision encoder with siglip2-base-patch16-512 and substitutes the language model with a Granite 165M LLM.

Granite-docling-258M is fully integrated into the Docling pipelines, which leverages all its capabilities for a one-shot prediction of all Docling features.

Features

  • 🏷️ DocTags for Efficient Tokenization – Introduces DocTags an efficient and minimal representation for documents that is fully compatible with DoclingDocuments.
  • 🔍 OCR (Optical Character Recognition) – Extracts text accurately from images.
  • 📐 Layout and Localization – Preserves document structure and document element bounding boxes.
  • 💻 Code Recognition – Detects and formats code blocks including indentation.
  • 🔢 Formula Recognition – [Enhanced] Identifies and processes mathematical expressions.
  • 🧮 Inline Equations – Better inline math recognition
  • 📊 Chart Recognition – Extracts and interprets chart data.
  • 📑 Table Recognition – Supports column and row headers for structured table extraction.
  • 🖼️ Figure Classification – Differentiates figures and graphical elements.
  • 📝 Caption Correspondence – Links captions to relevant images and figures.
  • 📜 List Grouping – Organizes and structures list elements correctly.
  • 📄 Full-Page Conversion – Processes entire pages for comprehensive document conversion including all page elements (code, equations, tables, charts etc.)
  • 🧩 Flexible Inference Modes – Choose between full-page inference, bbox-guided region inference
  • 📂 General Document Processing – Trained for both scientific and non-scientific documents.
  • 🧾 Document Element QA – Answer questions about a document’s structure such as the presence and order of document elements
  • 🌍 Multi-language – Japanese, Arabic and Chinese support (experimental)
  • 💨 Fast inference using VLLM – Avg of 0.35 secs per page on A100 GPU.

Intended Use

Granite-Docling is designed to complement the Docling library, not replace it. It integrates as a component within the larger Docling library, consolidating the functions of multiple single-purpose models into a single, compact VLM. However, Granite-Docling is not intended for general image understanding. For tasks focused solely on image-text input, we recommend using Granite Vision models, which are purpose-built and optimized for image-text processing.

Supported Instructions

Description Instruction Short Instruction
Full conversion Convert this page to docling. -
Chart Convert chart to table. <chart>
Formula Convert formula to LaTeX. <formula>
Code Convert code to text. <code>
Table Convert table to OTSL. (Lysak et al., 2023) <otsl>
Actions and Pipelines OCR the text in a specific location: <loc_155><loc_233><loc_206><loc_237> -
Identify element at: <loc_247><loc_482><loc_252><loc_486> -
Find all 'text' elements on the page, retrieve all section headers. -
Detect footer elements on the page. -

Learn more