I've started from Qwen/Qwen3-4B-Instruct-2507 fp16 and quantised it

tools

ollama run zendar79/qwen3:4b-q4_0

curl http://localhost:11434/api/chat \
  -d '{
    "model": "zendar79/qwen3:4b-q4_0",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

from ollama import chat

response = chat(
    model='zendar79/qwen3:4b-q4_0',
    messages=[{'role': 'user', 'content': 'Hello!'}],
)
print(response.message.content)

import ollama from 'ollama'

const response = await ollama.chat({
  model: 'zendar79/qwen3:4b-q4_0',
  messages: [{role: 'user', content: 'Hello!'}],
})
console.log(response.message.content)

Details

Updated 9 months ago

9 months ago

ad253fce0c56 · 2.4GB ·

model

archqwen3

parameters4.02B

quantizationQ4_0

2.4GB

template

{{ if .Messages }} {{- if or .System .Tools }}<|im_start|>system {{ .System }} {{- if .Tools }} # To

1.4kB

system

You are Qwen, created by Alibaba Cloud. You are a helpful assistant.

68B

params

{ "repeat_penalty": 1.05, "temperature": 0.7, "top_k": 20, "top_p": 0.8 }

65B

Readme

These are the steps I followed

Download from Hugging-Face CLI

pip install -U huggingface_hub
huggingface-cli download Qwen/Qwen3-4B-Instruct-2507 \
        --local-dir ./Qwen3-4B-Instruct-2507 \
        --exclude "*.git*" "README.md" ".gitattributes"

Produce a full-precision GGUF

python convert_hf_to_gguf.py ./Qwen3-4B-Instruct-2507 \
        --outfile ./qwen3-4b-f16.gguf \
        --outtype f16

Get the official llama.cpp repo

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
pip install -r requirements.txt

cmake -B build
cmake --build build --config Release

Or you can avoid this step and download the proper release to get the scripts

Do quantisation

./llama-quantize ./qwen3-4b-f16.gguf ./qwen3-4b-q4_k_m.gguf q4_k_m

./llama-quantize ./qwen3-4b-f16.gguf ./qwen3-4b-q4_0.gguf q4_0

Use with ollama

see this page to see the template format and how to import it on Ollama