New model scheduling
September 23, 2025
Ollama now includes a significantly improved model scheduling system. Ahead of running a model, Ollama’s new engine will now measure the exact amount of memory required compared to an estimation in previous versions of Ollama. This has several benefits:
- Significantly reduced crashes due to out of memory issues: Because memory management is exact, over-allocations no longer occur meaning fewer out of memory issues.
- Maximizing GPU utilization: Ollama’s new memory management allocates more memory to the GPU, increasing token generation and processing speeds
- Multi-GPU performance: Ollama will now schedule models more efficiently over multiple GPUs, significantly improving multi-GPU and mismatched GPU performance
- Accurate reporting: Measurements in tools like
nvidia-smi
will now matchollama ps
making it easy to track memory utilization on your system
All models implemented in Ollama’s new engine now have this new feature enabled by default, with more models coming soon as they transition to Ollama’s new engine.
Examples
Long context
- GPU: 1x NVIDIA GeForce RTX 4090
- Model:
gemma3:12b
- Context length: 128k
Old | New |
---|---|
52.02 tokens/s token generation speed | 85.54 tokens/s token generation speed |
19.9GiB of VRAM | 21.4GiB of VRAM |
48⁄49 layers loaded on GPU | 49⁄49 layers loaded on GPU |
Image input
- GPU: 2x NVIDIA GeForce RTX 4090
- Model:
mistral-small3.2
- Context length: 32k
Old | New |
---|---|
127.84 tokens/s prompt evaluation speed | 1380.24 tokens/s prompt evaluation speed |
43.15 tokens/s token generation speed | 55.61 tokens/s token generation speed |
19.9GiB of VRAM | 21.4GiB of VRAM |
40⁄41 layers loaded on GPU | 41⁄41 layers loaded on GPU + vision model |
Supported models
All models implemented in Ollama’s new engine use the new memory management features:
gpt-oss
llama4
,llama3.2-vision
(soon:llama3.2
,llama3.1
,llama3
)gemma3
,embeddinggemma
,gemma3n
qwen3
,qwen2.5vl
(soon:qwen3-coder
)mistral-small3.2
all-minilm
and other embedding models