jyan1 / paligemma-mix-224

PaliGemma is a versatile and lightweight vision-language model (VLM) inspired by PaLI-3 and based on open components such as the SigLIP vision model and the Gemma language model. It takes both image and text as input and generates text as output, supporting multiple languages. It is designed for class-leading fine-tune performance on a wide range of vision-language tasks such as image and short video caption, visual question answering, text reading, object detection and object segmentation.

Model architecture

PaliGemma is the composition of a Transformer decoder and a Vision Transformer image encoder, with a total of 3 billion params. The text decoder is initialized from Gemma-2B. The image encoder is initialized from SigLIP-So400m/14. PaliGemma is trained following the PaLI-3 recipes.

Inputs and outputs

Input: Image and text string, such as a prompt to caption the image, or a question.
Output: Generated text in response to the input, such as a caption of the image, an answer to a question, a list of object bounding box coordinates, or segmentation codewords.

Usage:

Ensure you are on this branch: Josh and Roy’s Paligemma Support :)

Run the model:

ollama run jyan1/paligemma-mix-224

Then at the prompt, include the path to your image in the prompt:

>>> What is in this image? /path/to/paligemma/puppy.jpg
Added image '/path/to/paligemma/puppy.jpg'
A brown dog wearing a floral shirt and lei stands proudly next to a clear blue 
pool. The dog's mouth is open, its paw rests on the edge of the water, and its 
eyes are focused on the horizon. The pool water is crystal clear, and the palm 
trees in the distance provide shade for the dog. A black leash connects the dog 
to its owner, and a flower lei is around the dog's neck. The dog's fur is brown, 
and its nose is black. The tree behind the pool is tall and slender, and the 
fence surrounding the pool is made of metal posts.

References

HuggingFace

## Model information

### Description
PaliGemma is a versatile and lightweight vision-language model (VLM) inspired by [PaLI-3](https://arxiv.org/abs/2310.09199) and based on open components such as the [SigLIP vision](https://arxiv.org/abs/2303.15343) model and the [Gemma language model](https://arxiv.org/abs/2403.08295). It takes both image and text as input and generates text as output, supporting multiple languages. It is designed for class-leading fine-tune performance on a wide range of vision-language tasks such as image and short video caption, visual question answering, text reading, object detection and object segmentation.

### Model architecture
PaliGemma is the composition of a [Transformer decoder](https://arxiv.org/abs/1706.03762) and a [Vision Transformer image encoder](https://arxiv.org/abs/2010.11929), with a total of 3 billion params. The text decoder is initialized from [Gemma-2B](https://www.kaggle.com/models/google/gemma). The image encoder is initialized from [SigLIP-So400m/14](https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/SigLIP_demo.ipynb). PaliGemma is trained following the PaLI-3 recipes.

### Inputs and outputs
- Input: Image and text string, such as a prompt to caption the image, or a question.
- Output: Generated text in response to the input, such as a caption of the image, an answer to a question, a list of object bounding box coordinates, or segmentation codewords.

### Usage:
Ensure you are on this branch: [Josh and Roy's Paligemma Support :)](https://github.com/ollama/ollama/pull/6393)

Run the model:
```
ollama run jyan1/paligemma-mix-224
```
Then at the prompt, include the path to your image in the prompt:
```
>>> What is in this image? /path/to/paligemma/puppy.jpg
Added image '/path/to/paligemma/puppy.jpg'
A brown dog wearing a floral shirt and lei stands proudly next to a clear blue 
pool. The dog's mouth is open, its paw rests on the edge of the water, and its 
eyes are focused on the horizon. The pool water is crystal clear, and the palm 
trees in the distance provide shade for the dog. A black leash connects the dog 
to its owner, and a flower lei is around the dog's neck. The dog's fur is brown, 
and its nose is black. The tree behind the pool is tall and slender, and the 
fence surrounding the pool is made of metal posts.
```

### References
[HuggingFace](https://huggingface.co/google/paligemma-3b-mix-448)

Paste, drop or click to upload images (.png, .jpeg, .jpg, .svg, .gif)