540 Downloads Updated 1 month ago
Granite-vision-3.3-2b is a compact and efficient vision-language model, specifically designed for visual document understanding, enabling automated content extraction from tables, charts, infographics, plots, diagrams, and more. Granite-vision-3.3-2b introduces several novel experimental features such as image segmentation, doctags generation, and multi-page support (see Experimental Capabilities for more details) and offers enhanced safety when compared to earlier Granite vision models.
The model was trained on a meticulously curated instruction-following data, comprising diverse public and synthetic datasets tailored to support a wide range of document understanding and general image tasks. Granite-vision-3.3-2b was trained by fine-tuning a Granite large language model with both image and text modalities.
ollama run ibm/granite3.3-vision:2b
The architecture of granite-vision-3.3-2b consists of the following components:
Vision encoder: SigLIP2
Vision-language connector: two-layer MLP with gelu activation function.
Large language model: granite-3.1-2b-instruct with 128k context length.
We built upon LLaVA to train our model. We use multi-layer encoder features and a denser grid resolution in AnyRes to enhance the model’s ability to understand nuanced visual content, which is essential for accurately interpreting document images.
Granite-vision-3.3-2b introduces three new experimental capabilities:
Image segmentation: A notebook showing a segmentation example
Doctags generation: Parse document images to structured text in doctags format. Please see Docling project for more details on doctags.
Multipage support: The model was trained to handle question answering (QA) tasks using multiple consecutive pages from a document—up to 8 pages—given the demands of long-context processing. To support such long sequences without exceeding GPU memory limits, we recommend resizing images so that their longer dimension is 768 pixels.