lfm2

1.1M Downloads Updated 3 months ago

LFM2 is a family of hybrid models designed for on-device deployment. LFM2-24B-A2B is the largest model in the family, scaling the architecture to 24 billion parameters while keeping inference efficient.

tools 24b

ollama run lfm2

curl http://localhost:11434/api/chat \
  -d '{
    "model": "lfm2",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from ollama import chat

response = chat(
    model='lfm2',
    messages=[{'role': 'user', 'content': 'Hello!'}],
)
print(response.message.content)

import ollama from 'ollama'

const response = await ollama.chat({
  model: 'lfm2',
  messages: [{role: 'user', content: 'Hello!'}],
})
console.log(response.message.content)

Applications

Claude Code ollama launch claude --model lfm2

Codex App ollama launch codex-app --model lfm2

OpenClaw ollama launch openclaw --model lfm2

Hermes Agent ollama launch hermes --model lfm2

Codex ollama launch codex --model lfm2

OpenCode ollama launch opencode --model lfm2

Models

View all →

Name

6 models

Size / Usage

Context

Input

lfm2:latest

14GB · 32K context window · Text · 3 months ago

lfm2:latest

14GB

32K

Text

lfm2:24b

latest

14GB · 32K context window · Text · 3 months ago

lfm2:24b latest

14GB

32K

Text

Readme

LFM2 is a family of hybrid models designed for on-device deployment. LFM2-24B-A2B is the largest model in the family, scaling the architecture to 24 billion parameters while keeping inference efficient.

Best-in-class efficiency: A 24B MoE model with only 2B active parameters per token, fitting in 32 GB of RAM for deployment on consumer laptops and desktops.
Fast edge inference: 112 tok/s decode on AMD CPU, 293 tok/s on H100. Fits in 32B GB of RAM.
Predictable scaling: Quality improves log-linearly from 350M to 24B total parameters, confirming the LFM2 hybrid architecture scales reliably across nearly two orders of magnitude.