13 Downloads Updated yesterday
ollama run oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU
Model: oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU
A lightweight, FIM (Fill-In-the-Middle) optimized variant of Qwen2.5-Coder-0.5B-Instruct using the fp16 GGUF quantization from HuggingFace. At only ~1 GB, it fit s comfortably on any 8 GB single GPU with headroom for 8K context — perfect for real-time code completions, inline suggestions, and lightweight agentic coding tasks.
The ideal lightweight coding companion for: RTX 4060 · RTX 5060 · RTX 3060 · GTX 1660 · Intel Arc · Apple Silicon M-series · any GPU with 2 GB+ VRAM
<|fim_prefix|>, <|fim_middle|>, <|fim_suffix|> support| Property | Value |
|---|---|
| Architecture | Dense decoder-only transformer |
| Total Parameters | 0.5B (494M) — all active per token |
| Layers | 24 |
| Hidden Dim | 1024 |
| Attention Heads | 16 |
| Native Context | 32,768 tokens |
| Configured Context | 8,192 tokens (FIM-optimized) |
| Modalities | Text only |
| Chat Format | ChatML (`< |
| FIM Tokens | <|fim_prefix|>, `< |
| Quantization | fp16 (maximum precision) |
| Model Size | ~1 GB |
| License | Apache 2.0 |
| Upstream | Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF |
| Resource | Minimum | Recommended |
|---|---|---|
| GPU Memory | 0.5 GB VRAM | 2 GB+ VRAM |
| System RAM | 2 GB | 4 GB |
| Disk Space | 2 GB free | 5 GB+ free |
| Ollama Version | 0.30.6+ | Latest |
Platform support: - Any GPU (NVIDIA, AMD, Intel Arc, Apple Silicon) with 0.5 GB+ VRAM - CPU-only — runs acceptably on modern CPUs (0.5B is tiny) - Integrated GPU — Intel UHD, AMD Radeon Graphics (iGPU) all work - Raspberry Pi — possible with smaller quantized variants
💡 Why 8 GB GPU? The model is literally ~1 GB. An 8 GB card has 7 GB of headroom — you can run this alongside your browser, IDE, and other GPU workloads withou t breaking a sweat.
# macOS (Homebrew)
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from https://ollama.com/download
# Pull the model (downloads ~1 GB)
ollama pull oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU
# Run interactively
ollama run oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU
# Single code prompt
ollama run oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU \
"Write a Python function that merges two sorted lists"
# Interactive chat
ollama run oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU
# With system prompt
ollama run oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU \
--system "You are a helpful coding assistant. Provide concise code examples."
# FIM-style prompt (inline)
ollama run oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU \
"Complete this function:\ndef fibonacci(n):\n "
# Chat completion
curl -s http://127.0.0.1:11434/api/chat \
-d '{
"model": "oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU",
"messages": [{"role": "user", "content": "Write a binary search in Python"}]
}'
# Generate (completion)
curl -s http://127.0.0.1:11434/api/generate \
-d '{
"model": "oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU",
"prompt": "Explain what FIM means in code completion",
"stream": false
}'
# FIM Completion (Fill-in-the-Middle)
curl -s http://127.0.0.1:11434/api/generate \
-d '{
"model": "oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU",
"prompt": "<|fim_prefix|>def hello():\n <|fim_suffix|>\n return False<|fim_middle|>",
"stream": false,
"options": {
"temperature": 0.5,
"top_p": 0.8,
"repeat_penalty": 1.15
}
}'
# OpenAI-compatible endpoint
curl -s http://127.0.0.1:11434/v1/chat/completions \
-d '{
"model": "oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU",
"messages": [{"role": "user", "content": "Write a React component"}],
"stream": false
}'
| Token | Purpose |
|---|---|
<|fim_prefix|> |
Marks the beginning of the code before the hole |
<|fim_suffix|> |
Marks the code after the hole |
<|fim_middle|> |
Marks where the model should fill in |
<|repo_name|> |
Optional: repository/file context |
<|file_sep|> |
Optional: separator between files |
pip install ollama
import ollama
# Chat
response = ollama.chat(
model='oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU',
messages=[{'role': 'user', 'content': 'Write a JavaScript debounce function'}],
)
print(response.message.content)
# FIM Completion
response = ollama.generate(
model='oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU',
prompt='<|fim_prefix|>def hello():\n <|fim_suffix|>\n return False<|fim_middle|>',
options={
'temperature': 0.5,
'top_p': 0.8,
'repeat_penalty': 1.15,
},
)
print(response.response)
npm install ollama
import ollama from 'ollama'
// Chat
const response = await ollama.chat({
model: 'oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU',
messages: [{ role: 'user', content: 'Write a CSS animation' }],
})
console.log(response.message.content)
These are baked into the model via its Modelfile:
| Parameter | Value | FIM Rationale |
|---|---|---|
num_ctx |
8192 | Enough context for function bodies and class definitions |
num_gpu |
99 | Offload all 24 layers to GPU |
temperature |
0.5 | Balances coherence vs. diversity in code completions |
top_p |
0.8 | Focused nucleus sampling for correct code |
top_k |
40 | Reasonable token variety for code generation |
min_p |
0.0 | Disabled; top_p already controls the nucleus |
repeat_penalty |
1.15 | Prevents FIM loops and repetitive code blocks |
stop |
<\|im_start\|>, <\|im_end\|> |
Chat template tokens |
stop |
<\|fim_prefix\|>, <\|fim_middle\|>, <\|fim_suffix\|> |
FIM boundary tokens |
./llama-cli \
-m qwen2.5-coder.0.5b-instruct-fp16.gguf \
-ngl 99 \
-c 8192 \
--temp 0.5 \
--top-p 0.8 \
--repeat-penalty 1.15 \
-cnv
| Component | Size |
|---|---|
| Model weights (fp16) | ~1 GB |
| KV cache (fp16, 8K context) | ~0.1 GB |
| Ollama process overhead | ~0.1 GB |
| Total | ~1.2 GB ✅ plenty of headroom |
On any GPU with 2 GB+ VRAM:
| Context | KV Cache | Fits 2 GB? | Notes |
|---|---|---|---|
| 8,192 | ~0.1 GB | ✅ Lots of headroom | ~1.2 GB total |
| 16,384 | ~0.2 GB | ✅ Plenty | ~1.3 GB total |
| 32,768 (native) | ~0.4 GB | ✅ Native max | ~1.5 GB total |
| Hardware | Prompt Processing | Text Generation |
|---|---|---|
| RTX 4060 (8 GB) | ~2,000-5,000 tok/s | ~200-500 tok/s |
| GTX 1660 (6 GB) | ~800-2,000 tok/s | ~100-300 tok/s |
| Intel Arc A770 | ~1,000-3,000 tok/s | ~150-400 tok/s |
| Apple Silicon M1 | ~500-1,500 tok/s | ~80-200 tok/s |
| CPU-only (modern) | ~50-200 tok/s | ~20-80 tok/s |
Add to ~/.config/opencode/opencode.jsonc:
"oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU": {
"name": "oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU",
"options": {
"supportsThinking": false,
"contextWindow": 8192
}
}
Use as agent:
"model": "ollama/oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU"
Or launch directly:
ollama launch opencode --model oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU
| Symptom | Fix |
|---|---|
| Model not found | Run ollama pull oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU first |
| Slow on CPU | Normal for CPU; enable any GPU with num_gpu 99 |
| FIM not working | Use raw /api/generate (not chat) with FIM tokens |
| Repetitive output | Override repeat_penalty to 1.2 or 1.25 at runtime |
| Nonsensical code | Lower temperature to 0.3 or 0.4 at runtime |
| OOM error | Reduce num_ctx — this model is so small OOM is nearly impossible |
# Larger coding model (needs 16 GB)
ollama pull oamazonasgabriel/qwen3.5-9b:q4-16gbGPU
# Larger MoE model (needs 24 GB)
ollama pull oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU
# Smaller quantized version
ollama pull qwen2.5-coder:0.5b
| Role | Entity |
|---|---|
| Base Model | Qwen Team, Alibaba Group |
| Original Model | Qwen2.5-Coder-0.5B-Instruct |
| GGUF Conversion | Qwen2.5-Coder-0.5B-Instruct-GGUF |
| Ollama Packaging | impacte.tech |
| License | Apache 2.0 |
| Resource | URL |
|---|---|
| This model on Ollama | https://ollama.com/oamazonasgabriel/qwen2.5-coder-0.5b |
| Upstream model (HuggingFace) | https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct |
| GGUF files (HuggingFace) | https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF |
| Qwen2.5-Coder paper | https://arxiv.org/abs/2409.12186 |
| Ollama docs | https://docs.ollama.com |
| Built by | impacte.tech |
amazonas@amazonas-home-lab:~/Projects/homelab/ollama-training$