Kimi K2.5 is an open-source, native multimodal agentic model that seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.
32.5K Pulls 1 Tag Updated 1 week ago
The most powerful vision-language model in the Qwen model family to date.
1.3M Pulls 59 Tags Updated 3 months ago
DeepSeek-OCR is a vision-language model that can perform token-efficient OCR.
150.1K Pulls 3 Tags Updated 2 months ago
Flagship vision-language model of Qwen and also a significant leap from the previous Qwen2-VL.
1.2M Pulls 17 Tags Updated 8 months ago
A compact and efficient vision-language model, specifically designed for visual document understanding, enabling automated content extraction from tables, charts, infographics, plots, diagrams, and more.
710.9K Pulls 5 Tags Updated 11 months ago
Building upon Mistral Small 3, Mistral Small 3.1 (2503) adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance.
571.5K Pulls 5 Tags Updated 10 months ago
🌋 LLaVA is a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding. Updated to version 1.6.
12.7M Pulls 98 Tags Updated 2 years ago
A series of multimodal LLMs (MLLMs) designed for vision-language understanding.
4.5M Pulls 17 Tags Updated 1 year ago
Llama 3.2 Vision is a collection of instruction-tuned image reasoning generative models in 11B and 90B sizes.
3.7M Pulls 9 Tags Updated 8 months ago
moondream2 is a small vision language model designed to run efficiently on edge devices.
590.3K Pulls 18 Tags Updated 1 year ago
14 Pulls 1 Tag Updated 5 months ago
A family of open-source models trained on a wide variety of data, surpassing ChatGPT on various benchmarks. Updated to version 3.5-0106.
411K Pulls 50 Tags Updated 2 years ago
A Gemini 2.5 Flash Level MLLM for Vision, Speech, and Full-Duplex Mulitmodal Live Streaming on Your Phone
283 Pulls 12 Tags Updated yesterday
This is a Highly Specialized Vision Model With More Then 2B Parameters.
146.9K Pulls 1 Tag Updated 2 months ago
Ultra-compact 256M vision-language model for video/image understanding. Supports visual QA, captioning, OCR, video analysis. Only 1.38GB VRAM. Built on SigLIP + SmolLM2. Available in Q8 and FP16. Apache 2.0 license.
112 Pulls 2 Tags Updated 1 week ago
High Quality Vision Instruct Model
78 Pulls 1 Tag Updated 1 week ago
The most powerful vision-language model in the Qwen3 model family to date.
45.5K Pulls 54 Tags Updated 2 months ago
2,413 Pulls 16 Tags Updated 2 months ago
State-of-the-art OCR (Optical Character Recognition) vision language model based on [allenai/olmOCR-2-7B-1025](https://huggingface.co/allenai/olmOCR-2-7B-1025).
2,301 Pulls 1 Tag Updated 3 months ago
GLM 4.6V Flash 9B model with vision, tools, and hybrid thinking enabled. using custom template to align it to ollama and the recomended sampling settigns by default. using unsloth quants at q4K_M
760 Pulls 1 Tag Updated 1 month ago