51 2 days ago

Astria is a multimodal model built by combining a LLaVA vision encoder with the new Ministral model, producing a unified system capable of detailed visual understanding and strong general-purpose reasoning.

vision tools 4b 8b

1 week ago

7ac30fe735fd · 2.9GB ·

llama
·
3.82B
·
Q4_K_M
clip
·
303M
·
F16
{{ if .System }}{{ .System }} {{- end }} {{- if .Tools }}When you receive a tool call response, use
You are Astria, a flagship multimodal model created by Ahmed Shafiq (Me7war). You combine a LLaVA-st
{ "num_ctx": 32768, "repeat_penalty": 1.1, "seed": 0, "temperature": 0.7, "top_k

Readme

Astria\_\_3\_-removebg-preview.png

Astria

Astria is a next-generation, fully local multimodal foundation model built on top of a Ministral-based language backbone and a custom vision encoder. This architecture significantly improves visual grounding, multilingual reasoning, and agentic reliability while remaining efficient enough for edge deployment.


🚀 Astria Update Highlights

Me7war’s latest Astria update pushes the limits of small-scale multimodal AI, combining efficiency, reasoning, and vision capabilities:

Key Features

  • Vision Mastery: Custom encoder enables deep image understanding and precise visual–text alignment.
  • Multilingual Support: Handles dozens of languages—English, French, Spanish, German, Italian, Portuguese, Dutch, Arabic, Chinese, Japanese, Korean—while maintaining strong reasoning and generation.
  • Agent-Ready: Native function calls, reliable JSON outputs, and strict prompt adherence make Astria fully agentic-capable.
  • Edge Efficiency: Optimized for minimal hardware without sacrificing performance.
  • Large Context Window: Up to 256k tokens for long-form reasoning, document-level comprehension, and complex multi-step tasks.
  • Enhanced Reasoning: Ministral backbone ensures stronger factual grounding, smoother multimodal alignment, and improved long-horizon reasoning. astria-benchmark.png

A fully local, compact model redefining what edge-deployable multimodal AI can achieve.


📊 Visual Reasoning Performance

Astria (1).png Astria applies a custom evaluation using GPT-5 PRO as the judge.

92.53% — New SOTA

LLaVA baseline: 90.92%

A custom evaluation on 30 unseen images with 3 instruction types per image (conversation, description, complex reasoning) shows Astria outperforms GPT-5 in all categories.


📦 Usage (Ollama)

Pull the model:

ollama pull Me7war/Astria

Run locally:

ollama run astria

Evaluation: Astria vs GPT-5

Astria-evulation.png

A custom evaluation set of 30 unseen images was constructed. Each image includes three instruction types:

  1. Conversational understanding
  2. Detailed visual description
  3. Complex multimodal reasoning

This yields 90 unique image–language tasks, evaluated on:

  • Astria
  • GPT-5

Scoring was performed by GPT-5 PRO, using a 1–10 scale per task.

Results

Astria outperforms GPT-5 across all instruction categories, validating the effectiveness of the custom vision encoder combined with the Ministral knowledge-enhanced language model.


Model Summary

  • Vision Encoder: Custom-built, with precise visual-text alignment
  • Language Backbone: Ministral-based, optimized for reasoning and factual accuracy
  • Training: End-to-end multimodal alignment with knowledge supervision
  • Output: Grounded, structured, and context-aware responses
  • Deployment: Fully local and edge-optimized, supporting up to 256k token context

License

Astria is released under the Astria License for personal and non-commercial use. Commercial use requires explicit permission from the creator.