isotnek/ qwen3.5:9B-Unsloth-UD-Q4_K_XL

505 1 month ago

*TEXT-ONLY* Unsloth Quantization of Qwen3.5:9B

tools thinking
ollama run isotnek/qwen3.5:9B-Unsloth-UD-Q4_K_XL

Details

1 month ago

ae3a55fdd011 · 6.0GB ·

qwen35
·
8.95B
·
Q4_K_M
{ "presence_penalty": 1.5, "temperature": 1, "top_k": 20, "top_p": 0.95 }
{{ .Prompt }}

Readme

This is an Unsloth quantization of Qwen3.5-9B. For a full list of other quants, the linked Unsloth HF repo is a great source. This specific quant was selected because of the analysis in this blog post, which found that this quant is a good balance of performance preservation and model size reduction.

This model is text-only because Ollama doesn’t yet support specifying mmproj files when creating Ollama Modelfiles with GGUFs. Still, this is a great model made better & faster by the good folks at Unsloth.

This model is set to reason by default. To disable reasoning in the Ollama CLI you can run:

ollama run isotnek/qwen3.5:9B-Unsloth-UD-Q4_K_XL

and then enter “/set nothink” in the chat window.

To run this (and other models in the linked Unsloth repo) for multimodal inference, I recommend instead using llama.cpp. To do so, download your desired model GGUF and mmproj-*.gguf files, and then:

brew install llama.cpp

llama-server \
  -m ./Qwen3.5-9B-UD-Q4_K_XL.gguf \ # or your preferred model file, if different
  --mmproj ./mmproj-BF16.gguf \ # or your preferred mmproj, if different
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 99 # enables the use of Apple's metal accelerator

and to run inference:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,'$(base64 -i /Path/To/Your/Image.png)'"}}
      ]
    }]
  }'