ibm/granite3.3-vision:2b

Details

Updated 6 months ago

6 months ago

5cdfea23a292 · 3.6GB ·

model

archgranite

parameters2.53B

quantizationQ8_0

2.7GB

projector

archclip

parameters442M

quantizationF16

893MB

template

{{- /* Tools */ -}} {{- if .Tools -}} <|start_of_role|>available_tools<|end_of_role|> {{- range $ind

1.3kB

system

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful,

154B

license

Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR US

11kB

params

{ "num_ctx": 16384, "temperature": 0 }

34B

Granite 3.3 Vision models

Granite-vision-3.3-2b is a compact and efficient vision-language model, specifically designed for visual document understanding, enabling automated content extraction from tables, charts, infographics, plots, diagrams, and more. Granite-vision-3.3-2b introduces several novel experimental features such as image segmentation, doctags generation, and multi-page support (see Experimental Capabilities for more details) and offers enhanced safety when compared to earlier Granite vision models.

The model was trained on a meticulously curated instruction-following data, comprising diverse public and synthetic datasets tailored to support a wide range of document understanding and general image tasks. Granite-vision-3.3-2b was trained by fine-tuning a Granite large language model with both image and text modalities.

Running

ollama run ibm/granite3.3-vision:2b

Model Architecture

The architecture of granite-vision-3.3-2b consists of the following components:

Vision encoder: SigLIP2
Vision-language connector: two-layer MLP with gelu activation function.
Large language model: granite-3.1-2b-instruct with 128k context length.

We built upon LLaVA to train our model. We use multi-layer encoder features and a denser grid resolution in AnyRes to enhance the model’s ability to understand nuanced visual content, which is essential for accurately interpreting document images.

Experimental Capabilities

Granite-vision-3.3-2b introduces three new experimental capabilities:

Image segmentation: A notebook showing a segmentation example
Doctags generation: Parse document images to structured text in doctags format. Please see Docling project for more details on doctags.
Multipage support: The model was trained to handle question answering (QA) tasks using multiple consecutive pages from a document—up to 8 pages—given the demands of long-context processing. To support such long sequences without exceeding GPU memory limits, we recommend resizing images so that their longer dimension is 768 pixels.

Learn more

Release Date: June 11th, 2025
License: Apache 2.0
https://www.ibm.com/granite