957 Downloads Updated 3 days ago
ollama run maternion/lfm2.5:8b-Q4_K_XL
If errors occur try setting
OLLAMA_NEW_ENGINE=1as an environment variable.
LFM2.5 is a new family of hybrid models designed for on-device deployment. It builds on the LFM2 architecture with extended pre-training and reinforcement learning.
Find more information about LFM2.5-8B-A1B in our blog post.

*AA-Omniscience Index (higher is better) rewards correct answers and penalizes hallucinations. Scores range from -100 to 100. See more results on Artificial Analysis.
| Model | Parameters | Description |
|---|---|---|
| LFM2.5-8B-A1B-Base | 8.3B total / 1.5B active | Pre-trained base model for fine-tuning |
| LFM2.5-8B-A1B | 8.3B total / 1.5B active | Reasoning-tuned general-purpose model |
LFM2.5-8B-A1B is a general-purpose text-only model with the following features:
temperature: 0.2top_p: 80repetition_penalty: 1.05| Model | Description |
|---|---|
| LFM2.5-8B-A1B | Original model checkpoint in native format. Best for fine-tuning or inference with Transformers, vLLM, and SGLang. |
| LFM2.5-8B-A1B-GGUF | Quantized format for llama.cpp and compatible tools. Optimized for edge inference and local deployment. |
| LFM2.5-8B-A1B-ONNX | ONNX Runtime format for cross-platform deployment. |
| LFM2.5-8B-A1B-MLX | MLX format for Apple Silicon. Optimized for fast inference on Mac devices. |
We recommend using LFM2.5-8B-A1B for agentic workflows, tool use, structured outputs, multilingual assistants, and on-device personal-assistant applications. It is not the best fit for heavy programming or knowledge-intensive question answering without retrieval.
LFM2.5 uses a ChatML-like format. See the Chat Template documentation for details. Example:
<|startoftext|><|im_start|>system
You are a helpful assistant trained by Liquid AI.<|im_end|>
<|im_start|>user
What is C. elegans?<|im_end|>
<|im_start|>assistant
Ollama Template for LFM2.5 is:
FROM ./lfm2.5-Q4_K_M.gguf
TEMPLATE {{ .Prompt }}
RENDERER lfm2
PARSER lfm2
PARAMETER temperature 0.2
PARAMETER top_p 80
PARAMETER repeat_penalty 1.05
Because LFM2.5-8B-A1B is a reasoning model, assistant turns contain an explicit chain of thought before the final answer. You can use tokenizer.apply_chat_template() to format your messages automatically.
LFM2.5 supports function calling in four steps:
tokenizer.apply_chat_template() with tools=....<|tool_call_start|> and <|tool_call_end|> special tokens), as the assistant answer. You can override this behavior by asking the model to output JSON function calls in the system prompt.tool role.See the Tool Use documentation for the full guide. Example:
<|startoftext|><|im_start|>system
List of tools: [{"name": "get_candidate_status", "description": "Retrieves the current status of a candidate in the recruitment process", "parameters": {"type": "object", "properties": {"candidate_id": {"type": "string", "description": "Unique identifier for the candidate"}}, "required": ["candidate_id"]}}]<|im_end|>
<|im_start|>user
What is the current status of candidate ID 12345?<|im_end|>
<|im_start|>assistant
<|tool_call_start|>[get_candidate_status(candidate_id="12345")]<|tool_call_end|>Checking the current status of candidate ID 12345.<|im_end|>
<|im_start|>tool
[{"candidate_id": "12345", "status": "Interview Scheduled", "position": "Clinical Research Associate", "date": "2023-11-20"}]<|im_end|>
<|im_start|>assistant
The candidate with ID 12345 is currently in the "Interview Scheduled" stage for the position of Clinical Research Associate, with an interview date set for 2023-11-20.<|im_end|>
LFM2.5-8B-A1B is supported by many inference frameworks. See the Inference documentation for the full list.
| Name | Description | Docs | Notebook |
|---|---|---|---|
| Transformers | Simple inference with direct access to model internals. | Link | ![]() |
| vLLM | High-throughput production deployments with GPU. | Link | ![]() |
| llama.cpp | Cross-platform inference with CPU offloading. | Link | ![]() |
| MLX | Apple’s machine learning framework optimized for Apple Silicon. | Link | — |
| LM Studio | Desktop application for running LLMs locally. | Link | — |
Quick start with Transformers (compatible with transformers>=5.0.0):
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
model_id = "LiquidAI/LFM2.5-8B-A1B"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
dtype="bfloat16",
# attn_implementation="flash_attention_2" <- uncomment on compatible GPU
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
prompt = "What is C. elegans?"
input_ids = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt}],
add_generation_prompt=True,
return_tensors="pt",
tokenize=True,
).to(model.device)
output = model.generate(
input_ids,
do_sample=True,
temperature=0.2,
top_k=80,
repetition_penalty=1.05,
max_new_tokens=8192,
streamer=streamer,
)
We recommend fine-tuning LFM2.5 for your specific use case to achieve the best results.
| Name | Description | Docs | Notebook |
|---|---|---|---|
| CPT (Unsloth) | Continued Pre-Training using Unsloth for text completion. | Link | ![]() |
| CPT (Unsloth) | Continued Pre-Training using Unsloth for translation. | Link | ![]() |
| SFT (Unsloth) | Supervised Fine-Tuning with LoRA using Unsloth. | Link | ![]() |
| SFT (TRL) | Supervised Fine-Tuning with LoRA using TRL. | Link | ![]() |
| DPO (TRL) | Direct Preference Optimization with LoRA using TRL. | Link | ![]() |
| GRPO (Unsloth) | GRPO with LoRA using Unsloth. | Link | ![]() |
| GRPO (TRL) | GRPO with LoRA using TRL. | Link | ![]() |
Thanks to reasoning, scaled-up pre-training, and large-scale RL, LFM2.5-8B-A1B improves over its predecessor across the board:
| Benchmark | LFM2-8B-A1B | LFM2.5-8B-A1B | Δ |
|---|---|---|---|
| AA-Omniscience Index | -78.42 | -24.70 | +53.62 |
| AA-Omniscience Accuracy | 7.33 | 8.67 | +1.34 |
| AA-Omniscience Non-Hallucination Rate | 7.46 | 63.47 | +56.01 |
| IFEval | 79.44 | 91.84 | +12.40 |
| IFBench | 26.00 | 56.47 | +30.47 |
| Multi-IF | 58.54 | 79.93 | +21.39 |
| MATH500 | 74.80 | 88.76 | +13.96 |
| AIME25 | 20.00 | 42.53 | +22.53 |
| BFCLv3 | 45.07 | 64.36 | +19.29 |
| BFCLv4 | 25.52 | 48.50 | +22.98 |
| Tau² Telecom | 13.60 | 88.07 | +74.47 |
| Tau² Retail | 7.02 | 39.82 | +32.80 |
| Model | Parameters | AA-Omni. Index | AA-Omni. Accuracy | AA-Omni. Non-Halluc. | IFEval | IFBench | Multi-IF |
|---|---|---|---|---|---|---|---|
| LFM2.5-8B-A1B | 8B/A1B | -24.70 | 8.67 | 63.47 | 91.84 | 56.47 | 79.93 |
| Granite-4.0-H-Tiny | 7B/A1B | -75.50 | 9.37 | 6.38 | 82.23 | 21.28 | 59.00 |
| Qwen3.5-4B | 4B | -51.53 | 17.20 | 16.99 | 87.80 | 50.38 | 67.43 |
| Qwen3-30B-A3B-Thinking-2507 | 30.5B/3.3B | -51.31 | 18.80 | 13.87 | 90.82 | 51.11 | 79.04 |
| Gemma-4-E2B-IT | 5.1B | -72 | 7.00 | 15.05 | 82.93 | 33.53 | 69.70 |
| Gemma-4-E4B-IT | 8B | -50.67 | 8.10 | 36.06 | 87.74 | 39.48 | 77.58 |
| Gemma-4-26B-A4B-IT | 26B/4B | -62.07 | 14.37 | 10.75 | 91.40 | 47.25 | 82.06 |
| gpt-oss-20b | 21B/3.6B | -49.17 | 14.57 | 24.50 | 86.73 | 58.65 | 76.64 |
| Model | Parameters | MATH500 | AIME25 | AIME26 | BFCLv3 | BFCLv4 | Tau² Telecom | Tau² Retail |
|---|---|---|---|---|---|---|---|---|
| LFM2.5-8B-A1B | 8B/A1B | 88.76 | 42.53 | 50.00 | 64.79 | 49.73 | 88.07 | 39.82 |
| Granite-4.0-H-Tiny | 7B/A1B | 59.20 | 4.93 | 3.33 | 56.89 | 28.52 | 16.67 | 18.42 |
| Qwen3.5-4B | 4B | 80.76 | 54.28 | 58.33 | 71.06 | 54.01 | 87.72 | 71.93 |
| Qwen3-30B-A3B-Thinking-2507 | 30.5B/3.3B | 86.48 | 71.67 | 66.67 | 73.39 | 50.53 | 21.93 | 56.14 |
| Gemma-4-E2B-IT | 5.1B | 64.00 | 26 | 30 | 56.44 | 31.91 | 22.37 | 18.95 |
| Gemma-4-E4B-IT | 8B | 65.00 | 34.33 | 40.67 | 57.31 | 33.92 | 26.75 | 42.11 |

LFM2.5-8B-A1B is the fastest model in its size class, reaching 18.5K output tokens per second at high concurrency, over 1.6B tokens per day on a single H100.

@article{liquidAI20268BA1B,
author = {Liquid AI},
title = {LFM2.5-8B-A1B: Personal Assistant On Your Laptop},
journal = {Liquid AI Blog},
year = {2026},
note = {www.liquid.ai/blog/lfm2-5-8b-a1b},
}
@article{liquidai2025lfm2,
title = {LFM2 Technical Report},
author = {Liquid AI},
journal = {arXiv preprint arXiv:2511.23404},
year = {2025}
}