833 9 months ago

Mistral Small 3.1 From HF

Models

View all →

Readme

Mistral Small 3.1: the best model in its weight class.

Building on Mistral Small 3, this new model comes with improved text performance, multimodal understanding, and an expanded context window of up to 128k tokens. The model outperforms comparable models like Gemma 3 and GPT-4o Mini, while delivering inference speeds of 150 tokens per second.

Mistral Small 3.1 is released under an Apache 2.0 license.

Key Features

  • Vision: Vision capabilities enable the model to analyze images and provide insights based on visual content in addition to text.
  • Multilingual: Supports dozens of languages, including English, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Nepali, Polish, Portuguese, Romanian, Russian, Serbian, Spanish, Swedish, Turkish, Ukrainian, Vietnamese, Arabic, Bengali, Chinese, Farshi.
  • Agent-Centric: Offers best-in-class agentic capabilities with native function calling and JSON outputting.
  • Advanced Reasoning: State-of-the-art conversational and reasoning capabilities.
  • Apache 2.0 License: Open license allowing usage and modification for both commercial and non-commercial purposes.
  • Context Window: A 128k context window.
  • System Prompt: Maintains strong adherence and support for system prompts.
  • Tokenizer: Utilizes a Tekken tokenizer with a 131k vocabulary size.

Benchmark Results

When available, we report numbers previously published by other model providers, otherwise we re-evaluate them using our own evaluation harness.

Pretrain Evals

Model MMLU (5-shot) MMLU Pro (5-shot CoT) TriviaQA GPQA Main (5-shot CoT) MMMU
Small 3.1 24B Base 81.01% 56.03% 80.50% 37.50% 59.27%
Gemma 3 27B PT 78.60% 52.20% 81.30% 24.30% 56.10%

Instruction Evals

Text

Model MMLU MMLU Pro (5-shot CoT) MATH GPQA Main (5-shot CoT) GPQA Diamond (5-shot CoT ) MBPP HumanEval SimpleQA (TotalAcc)
Small 3.1 24B Instruct 80.62% 66.76% 69.30% 44.42% 45.96% 74.71% 88.41% 10.43%
Gemma 3 27B IT 76.90% 67.50% 89.00% 36.83% 42.40% 74.40% 87.80% 10.00%
GPT4o Mini 82.00% 61.70% 70.20% 40.20% 39.39% 84.82% 87.20% 9.50%
Claude 3.5 Haiku 77.60% 65.00% 69.20% 37.05% 41.60% 85.60% 88.10% 8.02%
Cohere Aya-Vision 32B 72.14% 47.16% 41.98% 34.38% 33.84% 70.43% 62.20% 7.65%

Vision

Model MMMU MMMU PRO Mathvista ChartQA DocVQA AI2D MM MT Bench
Small 3.1 24B Instruct 64.00% 49.25% 68.91% 86.24% 94.08% 93.72% 7.3
Gemma 3 27B IT 64.90% 48.38% 67.60% 76.00% 86.60% 84.50% 7
GPT4o Mini 59.40% 37.60% 56.70% 76.80% 86.70% 88.10% 6.6
Claude 3.5 Haiku 60.50% 45.03% 61.60% 87.20% 90.00% 92.10% 6.5
Cohere Aya-Vision 32B 48.20% 31.50% 50.10% 63.04% 72.40% 82.57% 4.1

Multilingual Evals

Model Average European East Asian Middle Eastern
Small 3.1 24B Instruct 71.18% 75.30% 69.17% 69.08%
Gemma 3 27B IT 70.19% 74.14% 65.65% 70.76%
GPT4o Mini 70.36% 74.21% 65.96% 70.90%
Claude 3.5 Haiku 70.16% 73.45% 67.05% 70.00%
Cohere Aya-Vision 32B 62.15% 64.70% 57.61% 64.12%

Long Context Evals

Model LongBench v2 RULER 32K RULER 128K
Small 3.1 24B Instruct 37.18% 93.96% 81.20%
Gemma 3 27B IT 34.59% 91.10% 66.00%
GPT4o Mini 29.30% 90.20% 65.8%
Claude 3.5 Haiku 35.19% 92.60% 91.90%

Source Mistral.AI

perf-gpqa-diamond-mistral.png