Ollama
Models Docs Pricing
Sign in Download
Models Download Docs Pricing Sign in
⇅
Vision models · Ollama
Vision models on Ollama.
  • gemma4

    Gemma 4 models are designed to deliver frontier-level performance at each size. They are well-suited for reasoning, agentic workflows, coding, and multimodal understanding.

    vision tools thinking audio cloud e2b e4b 26b 31b

    3.2M  Pulls 29  Tags Updated  yesterday

  • qwen3.5

    Qwen 3.5 is a family of open-source multimodal models that delivers exceptional utility and performance.

    vision tools thinking cloud 0.8b 2b 4b 9b 27b 35b 122b

    6.2M  Pulls 58  Tags Updated  1 week ago

  • translategemma

    A new collection of open translation models built on Gemma 3, helping people communicate across 55 languages.

    vision 4b 12b 27b

    1.1M  Pulls 13  Tags Updated  2 months ago

  • ministral-3

    The Ministral 3 family is designed for edge deployment, capable of running on a wide range of hardware.

    vision tools cloud 3b 8b 14b

    936.2K  Pulls 16  Tags Updated  4 months ago

  • devstral-small-2

    24B model that excels at using tools to explore codebases, editing multiple files and power software engineering agents.

    vision tools cloud 24b

    774K  Pulls 6  Tags Updated  4 months ago

  • glm-ocr

    GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture.

    vision tools

    255.1K  Pulls 3  Tags Updated  2 months ago

  • kimi-k2.5

    Kimi K2.5 is an open-source, native multimodal agentic model that seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.

    vision tools thinking cloud

    242.8K  Pulls 1  Tag Updated  2 months ago

  • deepseek-ocr

    DeepSeek-OCR is a vision-language model that can perform token-efficient OCR.

    vision 3b

    400.5K  Pulls 3  Tags Updated  4 months ago

  • gemini-3-flash-preview

    Gemini 3 Flash offers frontier intelligence built for speed at a fraction of the cost.

    vision tools thinking cloud

    131.9K  Pulls 2  Tags Updated  3 months ago

  • mistral-large-3

    A general-purpose multimodal mixture-of-experts model for production-grade tasks and enterprise workloads.

    vision tools cloud

    42.8K  Pulls 1  Tag Updated  4 months ago

  • qwen3-vl

    The most powerful vision-language model in the Qwen model family to date.

    vision tools thinking cloud 2b 4b 8b 30b 32b 235b

    3.2M  Pulls 59  Tags Updated  5 months ago

  • mistral-small3.2

    An update to Mistral Small that improves on function calling, instruction following, and less repetition errors.

    vision tools 24b

    1.7M  Pulls 5  Tags Updated  9 months ago

  • qwen2.5vl

    Flagship vision-language model of Qwen and also a significant leap from the previous Qwen2-VL.

    vision 3b 7b 32b 72b

    1.8M  Pulls 17  Tags Updated  10 months ago

  • gemma3

    The current, most capable model that runs on a single GPU.

    vision cloud 270m 1b 4b 12b 27b

    35.5M  Pulls 29  Tags Updated  4 months ago

  • llava

    🌋 LLaVA is a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding. Updated to version 1.6.

    vision 7b 13b 34b

    13.8M  Pulls 98  Tags Updated  2 years ago

  • llama3.2-vision

    Llama 3.2 Vision is a collection of instruction-tuned image reasoning generative models in 11B and 90B sizes.

    vision 11b 90b

    4.4M  Pulls 9  Tags Updated  10 months ago

  • minicpm-v

    A series of multimodal LLMs (MLLMs) designed for vision-language understanding.

    vision 8b

    5M  Pulls 17  Tags Updated  1 year ago

  • llama4

    Meta's latest collection of multimodal models.

    vision tools 16x17b 128x17b

    1.6M  Pulls 11  Tags Updated  10 months ago

  • llava-llama3

    A LLaVA model fine-tuned from Llama 3 Instruct with better scores in several benchmarks.

    vision 8b

    2.2M  Pulls 4  Tags Updated  1 year ago

  • granite3.2-vision

    A compact and efficient vision-language model, specifically designed for visual document understanding, enabling automated content extraction from tables, charts, infographics, plots, diagrams, and more.

    vision tools 2b

    886.5K  Pulls 5  Tags Updated  1 year ago

© 2026 Ollama
Blog Contact