520 Downloads Updated 3 weeks ago
Updated 1 month ago
1 month ago
c807e8d2c7e5 · 522MB ·
Granite Docling is a multimodal Image-Text-to-Text model engineered for efficient document conversion. It preserves the core features of Docling while maintaining seamless integration with DoclingDocuments to ensure full compatibility.
Granite Docling 258M builds upon the Idefics3 architecture, but introduces two key modifications: it replaces the vision encoder with siglip2-base-patch16-512 and substitutes the language model with a Granite 165M LLM.
Granite-docling-258M is fully integrated into the Docling pipelines, which leverages all its capabilities for a one-shot prediction of all Docling features.
Granite-Docling is designed to complement the Docling library, not replace it. It integrates as a component within the larger Docling library, consolidating the functions of multiple single-purpose models into a single, compact VLM. However, Granite-Docling is not intended for general image understanding. For tasks focused solely on image-text input, we recommend using Granite Vision models, which are purpose-built and optimized for image-text processing.
| Description | Instruction | Short Instruction |
|---|---|---|
| Full conversion | Convert this page to docling. | - |
| Chart | Convert chart to table. | <chart> |
| Formula | Convert formula to LaTeX. | <formula> |
| Code | Convert code to text. | <code> |
| Table | Convert table to OTSL. (Lysak et al., 2023) | <otsl> |
| Actions and Pipelines | OCR the text in a specific location: <loc_155><loc_233><loc_206><loc_237> | - |
| Identify element at: <loc_247><loc_482><loc_252><loc_486> | - | |
| Find all 'text' elements on the page, retrieve all section headers. | - | |
| Detect footer elements on the page. | - |