Performance on an Apple M4 Max 128GB

Benchmarks on an Apple Mac Studio M4 Max 128GB (¹⁶⁄₄₀ Cores) while doing basic home office work in parallel.

This model took ~50 seconds to load after downloading. It takes around 15 seconds after a restart to load, once it went inactive after a while, it takes between 6 to 10 seconds to wake up accepting inputs to ollama run.

Test #1: Short prompt (significant answer length)

Prompt:

Explain quantum computing like I am 10. Build 10 topics and explain them with 5 sentences each.

Context Length: 2048

VRAM: 108 GB

total duration:       57.875090667s
load duration:        30.372542ms
prompt eval count:    35 token(s)
prompt eval duration: 552.349291ms
prompt eval rate:     63.37 tokens/s
eval count:           1032 token(s)
eval duration:        57.291978417s
eval rate:            18.01 tokens/s

Context Length: 4096

VRAM: 111 GB

total duration:       54.706218375s
load duration:        30.177458ms
prompt eval count:    35 token(s)
prompt eval duration: 800.697375ms
prompt eval rate:     43.71 tokens/s
eval count:           976 token(s)
eval duration:        53.874746708s
eval rate:            18.12 tokens/s

Context Length: 8192

VRAM: 117 GB

total duration:       48.719248791s
load duration:        32.634958ms
prompt eval count:    35 token(s)
prompt eval duration: 800.158375ms
prompt eval rate:     43.74 tokens/s
eval count:           877 token(s)
eval duration:        47.885500208s
eval rate:            18.31 tokens/s

Test #2: Medium prompt

Prompt:

Summarize this into 5 topics with 5 sentences each: this blog post with 2009 tokens

Context Length: 2048

VRAM: 108 GB

total duration:       1m9.037277875s
load duration:        32.804167ms
prompt eval count:    2048 token(s)
prompt eval duration: 12.508220667s
prompt eval rate:     163.73 tokens/s
eval count:           757 token(s)
eval duration:        56.495572083s
eval rate:            13.40 tokens/s

Context Length: 4096

VRAM: 111 GB

total duration:       1m12.05915525s
load duration:        32.846292ms
prompt eval count:    2048 token(s)
prompt eval duration: 12.573978083s
prompt eval rate:     162.88 tokens/s
eval count:           736 token(s)
eval duration:        59.451638709s
eval rate:            12.38 tokens/s

Context Length: 8192

VRAM: 117 GB

total duration:       1m10.380066708s
load duration:        33.03275ms
prompt eval count:    2048 token(s)
prompt eval duration: 12.561087333s
prompt eval rate:     163.04 tokens/s
eval count:           721 token(s)
eval duration:        57.7849875s
eval rate:            12.48 tokens/s

Test #3: Long prompt

Prompt:

Summarize this into 5 topics with 5 sentences each: this blog post with 4683 tokens

Context Length: 8192

VRAM: 117 GB

total duration:       2m3.843154916s
load duration:        30.491875ms
prompt eval count:    4787 token(s)
prompt eval duration: 32.470091667s
prompt eval rate:     147.43 tokens/s
eval count:           772 token(s)
eval duration:        1m31.341756s
eval rate:            8.45 tokens/s

Merge process

Note to self how I merged, updated and pushed this model to the Ollama library:

download ggufs
merge ggufs (pointing to the first file)
- ./llama-gguf-split --merge downloaded-model-00001-of-00003.gguf newmodel.gguf
get the original modelfile by downloading the Ollama model and using ollama show
- ollama show modelname —-modelfile > original-modelfile.txt
modify the model name, use the gguf file path as FROM …
create the new Ollama model
- ollama create mymodel --file new-modelfile.txt
double check the name (copy to a new name if required)
- ollama cp newmodel username/newmodel
push to the Ollama registry
- ollama push

Qwen3:235b Instruct 2507 - Unsloth Dynamic 2.0 (Q3 K XL)

Readme

Performance on an Apple M4 Max 128GB

Test #1: Short prompt (significant answer length)

Context Length: 2048

Context Length: 4096

Context Length: 8192

Test #2: Medium prompt

Context Length: 2048

Context Length: 4096

Context Length: 8192

Test #3: Long prompt

Context Length: 8192

Merge process