Gemma 4 models are designed to deliver frontier-level performance at each size. They are well-suited for reasoning, agentic workflows, coding, and multimodal understanding.
3.2M Pulls 29 Tags Updated yesterday
Qwen 3.5 is a family of open-source multimodal models that delivers exceptional utility and performance.
6.2M Pulls 58 Tags Updated 1 week ago
A new collection of open translation models built on Gemma 3, helping people communicate across 55 languages.
1.1M Pulls 13 Tags Updated 2 months ago
The Ministral 3 family is designed for edge deployment, capable of running on a wide range of hardware.
936.2K Pulls 16 Tags Updated 4 months ago
24B model that excels at using tools to explore codebases, editing multiple files and power software engineering agents.
774K Pulls 6 Tags Updated 4 months ago
GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture.
255.1K Pulls 3 Tags Updated 2 months ago
Kimi K2.5 is an open-source, native multimodal agentic model that seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.
242.8K Pulls 1 Tag Updated 2 months ago
DeepSeek-OCR is a vision-language model that can perform token-efficient OCR.
400.5K Pulls 3 Tags Updated 4 months ago
Gemini 3 Flash offers frontier intelligence built for speed at a fraction of the cost.
131.9K Pulls 2 Tags Updated 3 months ago
A general-purpose multimodal mixture-of-experts model for production-grade tasks and enterprise workloads.
42.8K Pulls 1 Tag Updated 4 months ago
The most powerful vision-language model in the Qwen model family to date.
3.2M Pulls 59 Tags Updated 5 months ago
An update to Mistral Small that improves on function calling, instruction following, and less repetition errors.
1.7M Pulls 5 Tags Updated 9 months ago
Flagship vision-language model of Qwen and also a significant leap from the previous Qwen2-VL.
1.8M Pulls 17 Tags Updated 10 months ago
The current, most capable model that runs on a single GPU.
35.5M Pulls 29 Tags Updated 4 months ago
🌋 LLaVA is a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding. Updated to version 1.6.
13.8M Pulls 98 Tags Updated 2 years ago
Llama 3.2 Vision is a collection of instruction-tuned image reasoning generative models in 11B and 90B sizes.
4.4M Pulls 9 Tags Updated 10 months ago
A series of multimodal LLMs (MLLMs) designed for vision-language understanding.
5M Pulls 17 Tags Updated 1 year ago
Meta's latest collection of multimodal models.
1.6M Pulls 11 Tags Updated 10 months ago
A LLaVA model fine-tuned from Llama 3 Instruct with better scores in several benchmarks.
2.2M Pulls 4 Tags Updated 1 year ago
A compact and efficient vision-language model, specifically designed for visual document understanding, enabling automated content extraction from tables, charts, infographics, plots, diagrams, and more.
886.5K Pulls 5 Tags Updated 1 year ago