Qwen3.5 optimized for low VRAM

tools thinking 9b

ollama run reecdev/qwen3.5-lowvram:9b

curl http://localhost:11434/api/chat \
  -d '{
    "model": "reecdev/qwen3.5-lowvram:9b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from ollama import chat

response = chat(
    model='reecdev/qwen3.5-lowvram:9b',
    messages=[{'role': 'user', 'content': 'Hello!'}],
)
print(response.message.content)

import ollama from 'ollama'

const response = await ollama.chat({
  model: 'reecdev/qwen3.5-lowvram:9b',
  messages: [{role: 'user', content: 'Hello!'}],
})
console.log(response.message.content)

Details

Updated 2 months ago

2 months ago

4cde4559a80a · 4.8GB ·

model

archqwen35

parameters8.95B

quantizationQ3_K_L

4.8GB

params

{ "presence_penalty": 1.5, "temperature": 1, "top_k": 20, "top_p": 0.95 }

65B

template

13B

Readme

Qwen3.5-LowVRAM

Qwen3.5-LowVRAM is a version of Qwen3.5 9B optimized for GPUs with 6 GB of VRAM, cutting VRAM usage by about ~1.2 GB with near-zero quality loss.

Basic Usage

You can pull Qwen3.5-LowVRAM like this:

ollama pull reecdev/qwen3.5-lowvram:9b

and run it:

ollama run reecdev/qwen3.5-lowvram:9b

Tested Hardware

Qwen3.5-LowVRAM was tested on an NVIDIA GeForce RTX 3050 6 GB on various tests such as tool-use and coding. It averaged 14 tokens per second (vs. regular Qwen3.5 9B: 2 tokens per second with model offloading) and was able to complete these tasks successfully.

Notes

Qwen3.5-LowVRAM is only reccomended for GPUs that are unable to run the regular Qwen3.5-9B. If your GPU is fully capable of running the regular Qwen3.5-9B, you should use that instead.