14 1 week ago

vision audio 2b
ollama run gabegoodhart/granite4.1-speech:2b

Models

View all →

Readme

Granite 4.1 Speech

Granite-Speech-4.1 is a compact and efficient speech-language model from IBM, purpose-built for multilingual automatic speech recognition (ASR) and bidirectional automatic speech translation (AST). It was created by modality-aligning an intermediate checkpoint of granite-4.0-1b-base to speech, and trained on 174,000 hours of audio from public corpora plus synthetic data.

Parameter Sizes

2B:

ollama run gabegoodhart/granite4.1-speech:2b /path/to/audio.wav "transcribe the speech with proper punctuation and capitalization."

Supported Languages

English, French, German, Spanish, Portuguese, and Japanese.

Speech translation (AST) is supported to and from English for the languages above, plus English-to-Italian and English-to-Mandarin.

Intended Use

Granite-Speech-4.1 is designed for enterprise applications that process speech inputs — converting speech to text and translating between English and the supported languages. The model accepts mono, 16 kHz audio along with a text prompt that specifies the task.

To trigger speech processing, include the <|audio|> tag in your prompt. If the model receives an unfamiliar or malformed prompt, it falls back to transcription by default.

Capabilities

  • Multilingual ASR — High-accuracy transcription across six languages, powered by a dual-head CTC conformer encoder (graphemic + BPE outputs) with frame importance sampling.
  • Speech Translation (AST) — Bidirectional translation between English and supported languages, including English-to-Italian and English-to-Mandarin.
  • Punctuation & Truecasing — Produces properly punctuated and capitalized output, including German noun capitalization, via a prompt change.
  • Keyword Biasing — Improved recognition of names, acronyms, and technical jargon when supplied with a keyword list.

Preferred Prompts by Task

Task Prompt
ASR (raw) can you transcribe the speech into a written format?
ASR (punctuation) transcribe the speech with proper punctuation and capitalization.
ASR (keyword biasing) transcribe the speech to text. Keywords: <kw1>, <kw2>, ...
AST (raw) translate the speech to <language>.
AST (punctuation) translate the speech to <language> with proper punctuation and capitalization.

Note: Non-English ASR still requires an English prompt.

Evaluation

On the Open ASR Leaderboard, Granite-Speech-4.1-2b achieves a mean WER of 5.33 at an RTFx of 231.29.

Dataset WER
LibriSpeech Clean 1.33
LibriSpeech Other 2.5
SPGISpeech 3.78
AMI 8.09
Earnings22 8.37
Gigaspeech 9.8

Learn more

  • Developers: IBM Granite Speech Team
  • Release Date: April 29, 2026
  • License: Apache 2.0