GigaChat3-10B-A1.8B is a dialogue model of the GigaChat family. The model is based on a Mixture-of-Experts (MoE) architecture with 10B total and 1.8B active parameters. The architecture includes Multi-head Latent Attention and Multi-Token Prediction.

Details

Updated 2 months ago

2 months ago

5415d36a3076 · 6.1GB ·

model

archdeepseek2

parameters10.7B

quantizationQ4_K_S

6.1GB

system

Ты — GigaChat, умный помощник, разработанный Сбером. Отвеч�

164B

params

{ "num_ctx": 8192, "stop": [ "<|start_header_id|>", "<|end_header_id|>",

175B

template

{{ if .System }}<|start_header_id|>system<|end_header_id|> {{ .System }}<|eot_id|>{{ end }}{{ if .Pr

254B

🇬🇧EN

GigaChat3-10B-A1.8B

We present GigaChat3-10B-A1.8B — a conversational model from the GigaChat family. The model is based on a Mixture-of-Experts (MoE) architecture with 10B total parameters and 1.8B active parameters.
The architecture includes Multi-head Latent Attention (MLA) and Multi-Token Prediction (MTP), which optimize the model for high inference throughput.
The model is trained on top of our base version (GigaChat3-10B-A1.8B-base) using high-quality SFT data.
This version is intended for high-performance fp8 inference; the bf16 version is available here — GigaChat3-10B-A1.8B.
More details are available in the Habr article.

Model Architecture

GigaChat3-10B-A1.8B uses a custom MoE architecture:

Multi-head Latent Attention (MLA)

Instead of standard Multi-head Attention, the model uses MLA. MLA enables efficient inference by compressing the Key-Value (KV) cache into a latent vector, which significantly reduces memory requirements and speeds up processing.

Multi-Token Prediction (MTP)

The model is trained using the Multi-Token Prediction (MTP) objective. This allows the model to predict multiple tokens in a single forward pass, accelerating generation by up to 40% via speculative/parallel generation techniques.

Training Data

The model was trained on 20T tokens.
We added 10 languages — from Chinese and Arabic to Uzbek and Kazakh — and expanded the set of data sources: books, academic data, code and math datasets. All data undergo deduplication, language filtering, and automated quality checks using heuristics and classifiers.
A key contribution to quality came from synthetic data: we generated around 5.5 trillion tokens of synthetic data. The corpus includes question–answer pairs for texts, reverse-prompt chains for data structuring, LLM notes with model comments embedded in texts, and millions of synthetic problems with solutions in mathematics and competitive programming (with synthetic tests) based on PromptCot.

Inference

One of the key advantages of GigaChat3-10B-A1.8B is inference speed. The model (especially in MTP mode) demonstrates throughput comparable to that of much smaller dense models.
We measured performance using vLLM v0.11.0, with bfloat16 and batch_size=1.
Link to the code.

Model	request_throughput	output_throughput	total_token_throughput	mean_ttft_ms
`Qwen3-1.7B`	1.689	357.308	726.093	11.824
`mtp-GigaChat3-10B-A1.8B-base`	1.533	333.620	678.894	26.345
`GigaChat3-10B-A1.8B-base`	1.077	234.363	476.912	31.053
`Qwen3-4B`	0.978	206.849	420.341	14.947
`Qwen3-8B`	0.664	140.432	285.375	16.663
`YandexGPT-5-Lite-8B-pretrain`	0.641	147.305	300.269	16.711

Benchmarks

Although the model has 10 billion parameters, its direct analogs are models in the 3–4 billion parameter range. However, due to high generation speed, we also compare it with even more compact models.

Metric	GigaChat 3 Lightning	Qwen3-1.7B-Instruct	Qwen3-4B-Instruct-2507	SmolLM3
MMLU_RU_FIVE_SHOT	0.6833	0.4876	0.5972	0.4998
RUBQ_ZERO_SHOT	0.6516	0.2557	0.3170	0.6363
MMLU_PRO_EN_FIVE_SHOT	0.6061	0.410	0.6849	0.5013
MMLU_EN_FIVE_SHOT	0.7403	0.60	0.7080	0.5992
BBH_THREE_SHOT	0.4525	0.3317	0.7165	0.4161
SuperGPQA	0.2731	0.2092	0.3745	0.2459
MATH_500_FOUR_SHOT	0.7000	0.7520	0.8880	0.8020
GPQA_COT_ZERO_SHOT	0.3502	0.2651	0.5370	0.3704
LiveCodeBench_ZERO_SHOT	0.2031	0.0794	0.3046	0.1656
HUMAN_EVAL_PLUS_ZERO_SHOT	0.6951	0.6280	0.8780	0.7012

🇷🇺RU

GigaChat3-10B-A1.8B

Представляем GigaChat3-10B-A1.8B — диалоговую модель семейства GigaChat. Модель основана на архитектуре Mixture-of-Experts (MoE) с 10B общих и 1.8B активных параметров. Архитектура включает Multi-head Latent Attention (MLA) и Multi-Token Prediction (MTP), за счет чего модель оптимизирована для высокой пропускной способности (throughput) при инференсе. Модель обучена поверх нашей базовой версии (GigaChat3-10B-A1.8B-base) с помощью высококачественных SFT-данных. Данная версия предназначена для высокопроизводительного инференса в fp8, модель в bf16 — GigaChat3-10B-A1.8B. Больше подробностей в хабр статье.

Архитектура модели

GigaChat3-10B-A1.8B использует кастомную MoE-архитектуру:

Multi-head Latent Attention (MLA)

Вместо стандартного Multi-head Attention модель использует MLA. MLA обеспечивает эффективный инференс за счет сжатия Key-Value (KV) кэша в латентный вектор, что значительно снижает требования к памяти и ускоряет обработку.

Multi-Token Prediction (MTP)

Модель обучена с использованием задачи Multi-Token Prediction (MTP). Это позволяет модели предсказывать несколько токенов за один проход, что ускоряет генерацию до 40% с помощью техник спекулятивной/параллельной генерации.

Данные для обучения

Модель обучена на 20Т токенов. Мы добавили 10 языков — от китайского и арабского до узбекского и казахского, а также расширили набор источников: книги, академические данные, датасеты по коду и математике. Все данные проходят дедупликацию, языковую фильтрацию и автоматические проверки качества при помощи эвристик и классификаторов. Ключевой вклад в качество внесла синтетика: мы сгенерировали около 5,5 триллионов токенов синтетических данных. В корпус входят вопросы-ответы к текстам, цепочки reverse-prompt для структурирования данных, LLM-заметки с комментариями от модели внутри текстов, миллионы синтетических задач с решениями по математике и олимпиадному программированию (с синтетическими тестами) на основе PromptCot.

Инференс

Одно из ключевых преимуществ GigaChat3-10B-A1.8B — скорость инференса. Модель (особенно в режиме MTP) демонстрирует пропускную способность, сопоставимую с пропускной способностью значительно меньших dense‑моделей. Мы измеряли с помощью vLLM v0.11.0, на типе bfloat16 c batch_size=1. Ссылка на код.

Модель	request_throughput	output_throughput	total_token_throughput	mean_ttft_ms
`Qwen3-1.7B`	1.689	357.308	726.093	11.824
`mtp-GigaChat3-10B-A1.8B-base`	1.533	333.620	678.894	26.345
`GigaChat3-10B-A1.8B-base`	1.077	234.363	476.912	31.053
`Qwen3-4B`	0.978	206.849	420.341	14.947
`Qwen3-8B`	0.664	140.432	285.375	16.663
`YandexGPT-5-Lite-8B-pretrain`	0.641	147.305	300.269	16.711

Бенчмарки

Хотя модель имеет 10 миллиардов параметров, её прямые аналоги — модели размером 3–4 миллиарда параметров. Однако благодаря высокой скорости генерации мы также сравниваем её с ещё более компактными моделями.

Метрика	GigaChat 3 Lightning	Qwen3-1.7B-Instruct	Qwen3-4B-Instruct-2507	SmolLM3
MMLU_RU_FIVE_SHOT	0.6833	0.4876	0.5972	0.4998
RUBQ_ZERO_SHOT	0.6516	0.2557	0.3170	0.6363
MMLU_PRO_EN_FIVE_SHOT	0.6061	0.410	0.6849	0.5013
MMLU_EN_FIVE_SHOT	0.7403	0.60	0.7080	0.5992
BBH_THREE_SHOT	0.4525	0.3317	0.7165	0.4161
SuperGPQA	0.2731	0.2092	0.3745	0.2459
MATH_500_FOUR_SHOT	0.7000	0.7520	0.8880	0.8020
GPQA_COT_ZERO_SHOT	0.3502	0.2651	0.5370	0.3704
LiveCodeBench_ZERO_SHOT	0.2031	0.0794	0.3046	0.1656
HUMAN_EVAL_PLUS_ZERO_SHOT	0.6951	0.6280	0.8780	0.7012

GigaChat3-10B-A1.8B is a dialogue model of the GigaChat family. The model is based on a Mixture-of-Experts (MoE) architecture with 10B total and 1.8B active parameters. The architecture includes Multi-head Latent Attention and Multi-Token Prediction.

Details

Readme

🇬🇧EN

GigaChat3-10B-A1.8B

Model Architecture

Multi-head Latent Attention (MLA)

Multi-Token Prediction (MTP)

Training Data

Inference

Benchmarks

🇷🇺RU

GigaChat3-10B-A1.8B

Архитектура модели

Multi-head Latent Attention (MLA)

Multi-Token Prediction (MTP)

Данные для обучения

Инференс

Бенчмарки