516 3 months ago

Qwen3:235b Instruct 2507 - Unsloth Dynamic 2.0 (Q3 K XL)

tools thinking

3 months ago

693b1b10da78 · 104GB ·

qwen3moe
·
235B
·
Q3_K_M
{{- $lastUserIdx := -1 -}} {{- range $idx, $msg := .Messages -}} {{- if eq $msg.Role "user" }}{{ $la
Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR US
{ "repeat_penalty": 1, "stop": [ "<|im_start|>", "<|im_end|>" ], "te

Readme

Merged gguf files from Unsloth’s Q3_K_XL using the default Qwen3:255b modelfile with the recommended settings from Qwen, slightly updated by the temperature and top_p defaults recommended by Unsloth (close to the end of the page).

Find the thinking model here.

Qwen Banner

Unsloth Sticker

Performance on an Apple M4 Max 128GB

Benchmarks on an Apple Mac Studio M4 Max 128GB (1640 Cores) while doing basic home office work in parallel.

This model took ~50 seconds to load after downloading. It takes around 15 seconds after a restart to load, once it went inactive after a while, it takes between 6 to 10 seconds to wake up accepting inputs to ollama run.

Test #1: Short prompt (significant answer length)

Prompt:

Explain quantum computing like I am 10. Build 10 topics and explain them with 5 sentences each.

Context Length: 2048

VRAM: 108 GB

total duration:       57.875090667s
load duration:        30.372542ms
prompt eval count:    35 token(s)
prompt eval duration: 552.349291ms
prompt eval rate:     63.37 tokens/s
eval count:           1032 token(s)
eval duration:        57.291978417s
eval rate:            18.01 tokens/s

Context Length: 4096

VRAM: 111 GB

total duration:       54.706218375s
load duration:        30.177458ms
prompt eval count:    35 token(s)
prompt eval duration: 800.697375ms
prompt eval rate:     43.71 tokens/s
eval count:           976 token(s)
eval duration:        53.874746708s
eval rate:            18.12 tokens/s

Context Length: 8192

VRAM: 117 GB

total duration:       48.719248791s
load duration:        32.634958ms
prompt eval count:    35 token(s)
prompt eval duration: 800.158375ms
prompt eval rate:     43.74 tokens/s
eval count:           877 token(s)
eval duration:        47.885500208s
eval rate:            18.31 tokens/s

Test #2: Medium prompt

Prompt:

Summarize this into 5 topics with 5 sentences each: this blog post with 2009 tokens

Context Length: 2048

VRAM: 108 GB

total duration:       1m9.037277875s
load duration:        32.804167ms
prompt eval count:    2048 token(s)
prompt eval duration: 12.508220667s
prompt eval rate:     163.73 tokens/s
eval count:           757 token(s)
eval duration:        56.495572083s
eval rate:            13.40 tokens/s

Context Length: 4096

VRAM: 111 GB

total duration:       1m12.05915525s
load duration:        32.846292ms
prompt eval count:    2048 token(s)
prompt eval duration: 12.573978083s
prompt eval rate:     162.88 tokens/s
eval count:           736 token(s)
eval duration:        59.451638709s
eval rate:            12.38 tokens/s

Context Length: 8192

VRAM: 117 GB

total duration:       1m10.380066708s
load duration:        33.03275ms
prompt eval count:    2048 token(s)
prompt eval duration: 12.561087333s
prompt eval rate:     163.04 tokens/s
eval count:           721 token(s)
eval duration:        57.7849875s
eval rate:            12.48 tokens/s

Test #3: Long prompt

Prompt:

Summarize this into 5 topics with 5 sentences each: this blog post with 4683 tokens

Context Length: 8192

VRAM: 117 GB

total duration:       2m3.843154916s
load duration:        30.491875ms
prompt eval count:    4787 token(s)
prompt eval duration: 32.470091667s
prompt eval rate:     147.43 tokens/s
eval count:           772 token(s)
eval duration:        1m31.341756s
eval rate:            8.45 tokens/s

Merge process

Note to self how I merged, updated and pushed this model to the Ollama library:

  • download ggufs
  • merge ggufs (pointing to the first file)
    • ./llama-gguf-split --merge downloaded-model-00001-of-00003.gguf newmodel.gguf
  • get the original modelfile by downloading the Ollama model and using ollama show
    • ollama show modelname —-modelfile > original-modelfile.txt
  • modify the model name, use the gguf file path as FROM …
  • create the new Ollama model
    • ollama create mymodel --file new-modelfile.txt
  • double check the name (copy to a new name if required)
    • ollama cp newmodel username/newmodel
  • push to the Ollama registry
    • ollama push