Ollama's highest performance on Apple Silicon yet with MLX

June 11, 2026

Ollama’s MLX engine has been updated to deliver its highest performance on Apple Silicon yet. By leaning more heavily on Apple’s unified memory and the Metal-backed MLX framework, models output higher quality responses, respond faster, and use less memory.

A coding agent with Gemma 4 12B on a MacBook Pro M5 Max. Ollama's improved MLX engine provides higher-quality results, higher output speeds and faster time to first token with thinking and multiple sub-agents.

Higher quality responses with NVFP4

Ollama’s MLX engine has been updated to support NVIDIA’s model-optimized NVFP4 format, allowing for higher quality outputs than other 4-bit quantization formats while maintaining state-of-the-art performance. As an added benefit, models that are optimized for datacenter deployment can now be imported and run on with Ollama’s MLX engine allowing for portability between the datacenter and the desktop.

NVFP4 tracks the local dynamic range of model weights more closely, reducing loss from quantization. When measuring the perplexity difference between q4_K_M, a common 4-bit quantization format available with Ollama, NVFP4, and unquantized bf16 weights for the Gemma 4 12B model, model-optimized NVFP4 roughly halves the quality loss while maintaining performance:

Perplexity
Gemma 4 12B – lower is better
NVFP4 roughly halves the quality loss of 4-bit quantization, relative to unquantized BF16.

Faster output performance

Ollama’s MLX engine is now up to 20% faster from new optimizations: several operations are now fused into single Metal kernels via MLX’s just-in-time compiler features, and we’ve reworked Ollama’s GPU-backed sampling to run more efficiently.

Output speed
tokens/s · higher is better
NVFP4 generates about 20% faster than q4_K_M on the updated engine.
Average output speed over 10 runs when provided an 8,300-token input prompt.

More responsive with agent workflows

Agent workloads are dominated by prompt processing. Every tool call is a new request, and every request resends the whole transcript: system prompt, tool definitions, and every file read so far. Over a single task the model ends up processing the same context dozens of times. Prefix caching avoids the repeated work, as long as each request picks up where the last one left off.

Real agent sessions don’t work that way for long. Ollama’s new snapshot system saves model state at key points across conversations, using the same approach that serves agent workloads in Ollama’s cloud:

Most new models make this harder than it sounds. Sliding-window attention and recurrent layers carry state that can’t be rewound. Once the model moves past a point in the conversation, that point can’t be recovered unless state was saved at the time. Ollama saves state at the points conversations are likely to return to: where they branch, at intervals through long prompts, and just before each response. Keeping snapshots selective and incremental leaves more memory for the model.

Get started

To run models with Ollama’s MLX engine, download the latest version of Ollama, then run a model:

ollama run gemma4:12b-mlx

For use in a coding agent, use ollama launch:

ollama launch pi --model gemma4:12b-mlx