A lightweight, FIM (Fill-In-the-Middle) optimized variant of Qwen2.5-Coder-0.5B-Instruct using the fp16 GGUF quantization from HuggingFace. At only ~1 GB, it fits comfortably on any 8 GB single GPU with headroom for 8K context.

Qwen2.5-Coder-0.5B — Ollama Model (8 GB) FIM-Optimized

Model: oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU

DESCRIPTION

A lightweight, FIM (Fill-In-the-Middle) optimized variant of Qwen2.5-Coder-0.5B-Instruct using the fp16 GGUF quantization from HuggingFace. At only ~1 GB, it fit s comfortably on any 8 GB single GPU with headroom for 8K context — perfect for real-time code completions, inline suggestions, and lightweight agentic coding tasks.

The ideal lightweight coding companion for: RTX 4060 · RTX 5060 · RTX 3060 · GTX 1660 · Intel Arc · Apple Silicon M-series · any GPU with 2 GB+ VRAM

Key Features

FIM-optimized: temperature 0.5, top_p 0.8, repeat_penalty 1.15 — tuned for precise fill-in-the-middle code completions
Tiny footprint: only ~1 GB — runs on virtually any GPU, including integrated GPUs
Full precision: fp16 weights preserve maximum model quality (0.5B params, 494M)
Blazing fast: 200-500+ tok/s on modern GPUs
8K context: ideal for function bodies, class definitions, and single-file completions
FIM tokens: native <|fim_prefix|>, <|fim_middle|>, <|fim_suffix|> support
ChatML format: compatible with standard instruct-tuning pipelines
Apache 2.0 license: free for commercial and personal use

Architecture

Property	Value
Architecture	Dense decoder-only transformer
Total Parameters	0.5B (494M) — all active per token
Layers	24
Hidden Dim	1024
Attention Heads	16
Native Context	32,768 tokens
Configured Context	8,192 tokens (FIM-optimized)
Modalities	Text only
Chat Format	ChatML (`<
FIM Tokens	`<\|fim_prefix\|>`, `<
Quantization	fp16 (maximum precision)
Model Size	~1 GB
License	Apache 2.0
Upstream	Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF

REQUIREMENTS

Resource	Minimum	Recommended
GPU Memory	0.5 GB VRAM	2 GB+ VRAM
System RAM	2 GB	4 GB
Disk Space	2 GB free	5 GB+ free
Ollama Version	0.30.6+	Latest

Platform support: - Any GPU (NVIDIA, AMD, Intel Arc, Apple Silicon) with 0.5 GB+ VRAM - CPU-only — runs acceptably on modern CPUs (0.5B is tiny) - Integrated GPU — Intel UHD, AMD Radeon Graphics (iGPU) all work - Raspberry Pi — possible with smaller quantized variants

💡 Why 8 GB GPU? The model is literally ~1 GB. An 8 GB card has 7 GB of headroom — you can run this alongside your browser, IDE, and other GPU workloads withou t breaking a sweat.

QUICK START

1. Install Ollama

# macOS (Homebrew)
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download

2. Pull & Run

# Pull the model (downloads ~1 GB)
ollama pull oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU

# Run interactively
ollama run oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU

# Single code prompt
ollama run oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU \
  "Write a Python function that merges two sorted lists"

USAGE

CLI

# Interactive chat
ollama run oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU

# With system prompt
ollama run oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU \
  --system "You are a helpful coding assistant. Provide concise code examples."

# FIM-style prompt (inline)
ollama run oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU \
  "Complete this function:\ndef fibonacci(n):\n    "

REST API

# Chat completion
curl -s http://127.0.0.1:11434/api/chat \
  -d '{
    "model": "oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU",
    "messages": [{"role": "user", "content": "Write a binary search in Python"}]
  }'

# Generate (completion)
curl -s http://127.0.0.1:11434/api/generate \
  -d '{
    "model": "oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU",
    "prompt": "Explain what FIM means in code completion",
    "stream": false
  }'

# FIM Completion (Fill-in-the-Middle)
curl -s http://127.0.0.1:11434/api/generate \
  -d '{
    "model": "oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU",
    "prompt": "<|fim_prefix|>def hello():\n    <|fim_suffix|>\n    return False<|fim_middle|>",
    "stream": false,
    "options": {
      "temperature": 0.5,
      "top_p": 0.8,
      "repeat_penalty": 1.15
    }
  }'

# OpenAI-compatible endpoint
curl -s http://127.0.0.1:11434/v1/chat/completions \
  -d '{
    "model": "oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU",
    "messages": [{"role": "user", "content": "Write a React component"}],
    "stream": false
  }'

FIM Token Reference

Token	Purpose
`<\|fim_prefix\|>`	Marks the beginning of the code before the hole
`<\|fim_suffix\|>`	Marks the code after the hole
`<\|fim_middle\|>`	Marks where the model should fill in
`<\|repo_name\|>`	Optional: repository/file context
`<\|file_sep\|>`	Optional: separator between files

Python (ollama library)

pip install ollama

import ollama

# Chat
response = ollama.chat(
    model='oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU',
    messages=[{'role': 'user', 'content': 'Write a JavaScript debounce function'}],
)
print(response.message.content)

# FIM Completion
response = ollama.generate(
    model='oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU',
    prompt='<|fim_prefix|>def hello():\n    <|fim_suffix|>\n    return False<|fim_middle|>',
    options={
        'temperature': 0.5,
        'top_p': 0.8,
        'repeat_penalty': 1.15,
    },
)
print(response.response)

JavaScript (ollama.js)

npm install ollama

import ollama from 'ollama'

// Chat
const response = await ollama.chat({
  model: 'oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU',
  messages: [{ role: 'user', content: 'Write a CSS animation' }],
})
console.log(response.message.content)

SAMPLING PARAMETERS

These are baked into the model via its Modelfile:

Parameter	Value	FIM Rationale
`num_ctx`	8192	Enough context for function bodies and class definitions
`num_gpu`	99	Offload all 24 layers to GPU
`temperature`	0.5	Balances coherence vs. diversity in code completions
`top_p`	0.8	Focused nucleus sampling for correct code
`top_k`	40	Reasonable token variety for code generation
`min_p`	0.0	Disabled; `top_p` already controls the nucleus
`repeat_penalty`	1.15	Prevents FIM loops and repetitive code blocks
`stop`	`<\\|im_start\\|>`, `<\\|im_end\\|>`	Chat template tokens
`stop`	`<\\|fim_prefix\\|>`, `<\\|fim_middle\\|>`, `<\\|fim_suffix\\|>`	FIM boundary tokens

Equivalent llama.cpp Command

./llama-cli \
  -m qwen2.5-coder.0.5b-instruct-fp16.gguf \
  -ngl 99 \
  -c 8192 \
  --temp 0.5 \
  --top-p 0.8 \
  --repeat-penalty 1.15 \
  -cnv

MEMORY & PERFORMANCE

VRAM Budget

Component	Size
Model weights (fp16)	~1 GB
KV cache (fp16, 8K context)	~0.1 GB
Ollama process overhead	~0.1 GB
Total	~1.2 GB ✅ plenty of headroom

Context Window Scaling

On any GPU with 2 GB+ VRAM:

Context	KV Cache	Fits 2 GB?	Notes
8,192	~0.1 GB	✅ Lots of headroom	~1.2 GB total
16,384	~0.2 GB	✅ Plenty	~1.3 GB total
32,768 (native)	~0.4 GB	✅ Native max	~1.5 GB total

Performance

Hardware	Prompt Processing	Text Generation
RTX 4060 (8 GB)	~2,000-5,000 tok/s	~200-500 tok/s
GTX 1660 (6 GB)	~800-2,000 tok/s	~100-300 tok/s
Intel Arc A770	~1,000-3,000 tok/s	~150-400 tok/s
Apple Silicon M1	~500-1,500 tok/s	~80-200 tok/s
CPU-only (modern)	~50-200 tok/s	~20-80 tok/s

Opencode Integration

Add to ~/.config/opencode/opencode.jsonc:

"oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU": {
  "name": "oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU",
  "options": {
    "supportsThinking": false,
    "contextWindow": 8192
  }
}

Use as agent:

"model": "ollama/oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU"

Or launch directly:

ollama launch opencode --model oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU

TROUBLESHOOTING

Symptom	Fix
Model not found	Run `ollama pull oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU` first
Slow on CPU	Normal for CPU; enable any GPU with `num_gpu 99`
FIM not working	Use raw `/api/generate` (not chat) with FIM tokens
Repetitive output	Override `repeat_penalty` to 1.2 or 1.25 at runtime
Nonsensical code	Lower `temperature` to 0.3 or 0.4 at runtime
OOM error	Reduce `num_ctx` — this model is so small OOM is nearly impossible

ALTERNATIVE MODELS

# Larger coding model (needs 16 GB)
ollama pull oamazonasgabriel/qwen3.5-9b:q4-16gbGPU

# Larger MoE model (needs 24 GB)
ollama pull oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU

# Smaller quantized version
ollama pull qwen2.5-coder:0.5b

CREDITS

Role	Entity
Base Model	Qwen Team, Alibaba Group
Original Model	Qwen2.5-Coder-0.5B-Instruct
GGUF Conversion	Qwen2.5-Coder-0.5B-Instruct-GGUF
Ollama Packaging	impacte.tech
License	Apache 2.0

LINKS

Resource	URL
This model on Ollama	https://ollama.com/oamazonasgabriel/qwen2.5-coder-0.5b
Upstream model (HuggingFace)	https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct
GGUF files (HuggingFace)	https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF
Qwen2.5-Coder paper	https://arxiv.org/abs/2409.12186
Ollama docs	https://docs.ollama.com
Built by	impacte.tech

amazonas@amazonas-home-lab:~/Projects/homelab/ollama-training$