New model scheduling

Ollama now includes a significantly improved model scheduling system. Ahead of running a model, Ollama’s new engine will now measure the exact amount of memory required compared to an estimation in previous versions of Ollama. This has several benefits:

Significantly reduced crashes due to out of memory issues: Because memory management is exact, over-allocations no longer occur meaning fewer out of memory issues.
Maximizing GPU utilization: Ollama’s new memory management allocates more memory to the GPU, increasing token generation and processing speeds
Multi-GPU performance: Ollama will now schedule models more efficiently over multiple GPUs, significantly improving multi-GPU and mismatched GPU performance
Accurate reporting: Measurements in tools like nvidia-smi will now match ollama ps making it easy to track memory utilization on your system

All models implemented in Ollama’s new engine now have this new feature enabled by default, with more models coming soon as they transition to Ollama’s new engine.

Examples

Long context

GPU: 1x NVIDIA GeForce RTX 4090
Model: gemma3:12b
Context length: 128k

Old	New
52.02 tokens/s token generation speed	85.54 tokens/s token generation speed
19.9GiB of VRAM	21.4GiB of VRAM
⁴⁸⁄₄₉ layers loaded on GPU	⁴⁹⁄₄₉ layers loaded on GPU

Image input

GPU: 2x NVIDIA GeForce RTX 4090
Model: mistral-small3.2
Context length: 32k

Old	New
127.84 tokens/s prompt evaluation speed	1380.24 tokens/s prompt evaluation speed
43.15 tokens/s token generation speed	55.61 tokens/s token generation speed
19.9GiB of VRAM	21.4GiB of VRAM
⁴⁰⁄₄₁ layers loaded on GPU	⁴¹⁄₄₁ layers loaded on GPU + vision model

Supported models

All models implemented in Ollama’s new engine use the new memory management features:

gpt-oss
llama4, llama3.2-vision (soon: llama3.2, llama3.1, llama3)
gemma3, embeddinggemma, gemma3n
qwen3, qwen2.5vl (soon: qwen3-coder)
mistral-small3.2
all-minilm and other embedding models

September 23, 2025

Examples

Long context

Image input

Supported models