Ollama's highest performance on Apple Silicon yet with MLX
June 11, 2026
Ollama’s MLX engine has been updated to deliver its highest performance on Apple Silicon yet. By leaning more heavily on Apple’s unified memory and the Metal-backed MLX framework, models output higher quality responses, respond faster, and use less memory.
Higher quality responses with NVFP4
Ollama’s MLX engine has been updated to support NVIDIA’s model-optimized NVFP4 format, allowing for higher quality outputs than other 4-bit quantization formats while maintaining state-of-the-art performance. As an added benefit, models that are optimized for datacenter deployment can now be imported and run on with Ollama’s MLX engine allowing for portability between the datacenter and the desktop.
NVFP4 tracks the local dynamic range of model weights more closely, reducing loss from quantization. When measuring the perplexity difference between q4_K_M, a common 4-bit quantization format available with Ollama, NVFP4, and unquantized bf16 weights for the Gemma 4 12B model, model-optimized NVFP4 roughly halves the quality loss while maintaining performance:
Faster output performance
Ollama’s MLX engine is now up to 20% faster from new optimizations: several operations are now fused into single Metal kernels via MLX’s just-in-time compiler features, and we’ve reworked Ollama’s GPU-backed sampling to run more efficiently.
Average output speed over 10 runs when provided an 8,300-token input prompt.
More responsive with agent workflows
Agent workloads are dominated by prompt processing. Every tool call is a new request, and every request resends the whole transcript: system prompt, tool definitions, and every file read so far. Over a single task the model ends up processing the same context dozens of times. Prefix caching avoids the repeated work, as long as each request picks up where the last one left off.
Real agent sessions don’t work that way for long. Ollama’s new snapshot system saves model state at key points across conversations, using the same approach that serves agent workloads in Ollama’s cloud:
Multiple agents: An agent hands off to a subagent and picks back up later, or two sessions run at the same time. Each one resumes from its own saved state, and anything they have in common — often tens of thousands of tokens of system prompt, tool definitions, and ingested files — is only processed once.
Thinking models: Reasoning tokens are generated, then dropped from the conversation history, so the next request never matches the state the engine just built. Each turn would normally reprocess the whole conversation. A snapshot taken right before the response starts gives the next turn somewhere to resume from.
Branching and retries: A different follow-up or a regenerated response diverges from the cached conversation instead of extending it. Because snapshots exist where conversations split, only the new direction needs to be processed.
Most new models make this harder than it sounds. Sliding-window attention and recurrent layers carry state that can’t be rewound. Once the model moves past a point in the conversation, that point can’t be recovered unless state was saved at the time. Ollama saves state at the points conversations are likely to return to: where they branch, at intervals through long prompts, and just before each response. Keeping snapshots selective and incremental leaves more memory for the model.
Get started
To run models with Ollama’s MLX engine, download the latest version of Ollama, then run a model:
ollama run gemma4:12b-mlx
For use in a coding agent, use ollama launch:
ollama launch pi --model gemma4:12b-mlx