New model scheduling

September 23, 2025

Ollama waiting in line

Ollama now includes a significantly improved model scheduling system. Ahead of running a model, Ollama’s new engine will now measure the exact amount of memory required compared to an estimation in previous versions of Ollama. This has several benefits:

All models implemented in Ollama’s new engine now have this new feature enabled by default, with more models coming soon as they transition to Ollama’s new engine.

Examples

Long context

Old New
52.02 tokens/s token generation speed 85.54 tokens/s token generation speed
19.9GiB of VRAM 21.4GiB of VRAM
4849 layers loaded on GPU 4949 layers loaded on GPU

Image input

Old New
127.84 tokens/s prompt evaluation speed 1380.24 tokens/s prompt evaluation speed
43.15 tokens/s token generation speed 55.61 tokens/s token generation speed
19.9GiB of VRAM 21.4GiB of VRAM
4041 layers loaded on GPU 4141 layers loaded on GPU + vision model

Supported models

All models implemented in Ollama’s new engine use the new memory management features: