13 yesterday

A lightweight, FIM (Fill-In-the-Middle) optimized variant of Qwen2.5-Coder-0.5B-Instruct using the fp16 GGUF quantization from HuggingFace. At only ~1 GB, it fits comfortably on any 8 GB single GPU with headroom for 8K context.

ollama run oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU

Models

View all →

Readme

Qwen2.5-Coder-0.5B — Ollama Model (8 GB) FIM-Optimized

Model: oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU

Ollama Built by impacte.tech


DESCRIPTION

A lightweight, FIM (Fill-In-the-Middle) optimized variant of Qwen2.5-Coder-0.5B-Instruct using the fp16 GGUF quantization from HuggingFace. At only ~1 GB, it fit s comfortably on any 8 GB single GPU with headroom for 8K context — perfect for real-time code completions, inline suggestions, and lightweight agentic coding tasks.

The ideal lightweight coding companion for: RTX 4060 · RTX 5060 · RTX 3060 · GTX 1660 · Intel Arc · Apple Silicon M-series · any GPU with 2 GB+ VRAM

Key Features

  • FIM-optimized: temperature 0.5, top_p 0.8, repeat_penalty 1.15 — tuned for precise fill-in-the-middle code completions
  • Tiny footprint: only ~1 GB — runs on virtually any GPU, including integrated GPUs
  • Full precision: fp16 weights preserve maximum model quality (0.5B params, 494M)
  • Blazing fast: 200-500+ tok/s on modern GPUs
  • 8K context: ideal for function bodies, class definitions, and single-file completions
  • FIM tokens: native <|fim_prefix|>, <|fim_middle|>, <|fim_suffix|> support
  • ChatML format: compatible with standard instruct-tuning pipelines
  • Apache 2.0 license: free for commercial and personal use

Architecture

Property Value
Architecture Dense decoder-only transformer
Total Parameters 0.5B (494M) — all active per token
Layers 24
Hidden Dim 1024
Attention Heads 16
Native Context 32,768 tokens
Configured Context 8,192 tokens (FIM-optimized)
Modalities Text only
Chat Format ChatML (`<
FIM Tokens <|fim_prefix|>, `<
Quantization fp16 (maximum precision)
Model Size ~1 GB
License Apache 2.0
Upstream Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF

REQUIREMENTS

Resource Minimum Recommended
GPU Memory 0.5 GB VRAM 2 GB+ VRAM
System RAM 2 GB 4 GB
Disk Space 2 GB free 5 GB+ free
Ollama Version 0.30.6+ Latest

Platform support: - Any GPU (NVIDIA, AMD, Intel Arc, Apple Silicon) with 0.5 GB+ VRAM - CPU-only — runs acceptably on modern CPUs (0.5B is tiny) - Integrated GPU — Intel UHD, AMD Radeon Graphics (iGPU) all work - Raspberry Pi — possible with smaller quantized variants

💡 Why 8 GB GPU? The model is literally ~1 GB. An 8 GB card has 7 GB of headroom — you can run this alongside your browser, IDE, and other GPU workloads withou t breaking a sweat.


QUICK START

1. Install Ollama

# macOS (Homebrew)
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download

2. Pull & Run

# Pull the model (downloads ~1 GB)
ollama pull oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU

# Run interactively
ollama run oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU

# Single code prompt
ollama run oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU \
  "Write a Python function that merges two sorted lists"

USAGE

CLI

# Interactive chat
ollama run oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU

# With system prompt
ollama run oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU \
  --system "You are a helpful coding assistant. Provide concise code examples."

# FIM-style prompt (inline)
ollama run oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU \
  "Complete this function:\ndef fibonacci(n):\n    "

REST API

# Chat completion
curl -s http://127.0.0.1:11434/api/chat \
  -d '{
    "model": "oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU",
    "messages": [{"role": "user", "content": "Write a binary search in Python"}]
  }'

# Generate (completion)
curl -s http://127.0.0.1:11434/api/generate \
  -d '{
    "model": "oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU",
    "prompt": "Explain what FIM means in code completion",
    "stream": false
  }'

# FIM Completion (Fill-in-the-Middle)
curl -s http://127.0.0.1:11434/api/generate \
  -d '{
    "model": "oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU",
    "prompt": "<|fim_prefix|>def hello():\n    <|fim_suffix|>\n    return False<|fim_middle|>",
    "stream": false,
    "options": {
      "temperature": 0.5,
      "top_p": 0.8,
      "repeat_penalty": 1.15
    }
  }'

# OpenAI-compatible endpoint
curl -s http://127.0.0.1:11434/v1/chat/completions \
  -d '{
    "model": "oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU",
    "messages": [{"role": "user", "content": "Write a React component"}],
    "stream": false
  }'

FIM Token Reference

Token Purpose
<|fim_prefix|> Marks the beginning of the code before the hole
<|fim_suffix|> Marks the code after the hole
<|fim_middle|> Marks where the model should fill in
<|repo_name|> Optional: repository/file context
<|file_sep|> Optional: separator between files

Python (ollama library)

pip install ollama
import ollama

# Chat
response = ollama.chat(
    model='oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU',
    messages=[{'role': 'user', 'content': 'Write a JavaScript debounce function'}],
)
print(response.message.content)

# FIM Completion
response = ollama.generate(
    model='oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU',
    prompt='<|fim_prefix|>def hello():\n    <|fim_suffix|>\n    return False<|fim_middle|>',
    options={
        'temperature': 0.5,
        'top_p': 0.8,
        'repeat_penalty': 1.15,
    },
)
print(response.response)

JavaScript (ollama.js)

npm install ollama
import ollama from 'ollama'

// Chat
const response = await ollama.chat({
  model: 'oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU',
  messages: [{ role: 'user', content: 'Write a CSS animation' }],
})
console.log(response.message.content)

SAMPLING PARAMETERS

These are baked into the model via its Modelfile:

Parameter Value FIM Rationale
num_ctx 8192 Enough context for function bodies and class definitions
num_gpu 99 Offload all 24 layers to GPU
temperature 0.5 Balances coherence vs. diversity in code completions
top_p 0.8 Focused nucleus sampling for correct code
top_k 40 Reasonable token variety for code generation
min_p 0.0 Disabled; top_p already controls the nucleus
repeat_penalty 1.15 Prevents FIM loops and repetitive code blocks
stop <\|im_start\|>, <\|im_end\|> Chat template tokens
stop <\|fim_prefix\|>, <\|fim_middle\|>, <\|fim_suffix\|> FIM boundary tokens

Equivalent llama.cpp Command

./llama-cli \
  -m qwen2.5-coder.0.5b-instruct-fp16.gguf \
  -ngl 99 \
  -c 8192 \
  --temp 0.5 \
  --top-p 0.8 \
  --repeat-penalty 1.15 \
  -cnv

MEMORY & PERFORMANCE

VRAM Budget

Component Size
Model weights (fp16) ~1 GB
KV cache (fp16, 8K context) ~0.1 GB
Ollama process overhead ~0.1 GB
Total ~1.2 GB ✅ plenty of headroom

Context Window Scaling

On any GPU with 2 GB+ VRAM:

Context KV Cache Fits 2 GB? Notes
8,192 ~0.1 GB ✅ Lots of headroom ~1.2 GB total
16,384 ~0.2 GB ✅ Plenty ~1.3 GB total
32,768 (native) ~0.4 GB ✅ Native max ~1.5 GB total

Performance

Hardware Prompt Processing Text Generation
RTX 4060 (8 GB) ~2,000-5,000 tok/s ~200-500 tok/s
GTX 1660 (6 GB) ~800-2,000 tok/s ~100-300 tok/s
Intel Arc A770 ~1,000-3,000 tok/s ~150-400 tok/s
Apple Silicon M1 ~500-1,500 tok/s ~80-200 tok/s
CPU-only (modern) ~50-200 tok/s ~20-80 tok/s

Opencode Integration

Add to ~/.config/opencode/opencode.jsonc:

"oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU": {
  "name": "oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU",
  "options": {
    "supportsThinking": false,
    "contextWindow": 8192
  }
}

Use as agent:

"model": "ollama/oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU"

Or launch directly:

ollama launch opencode --model oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU

TROUBLESHOOTING

Symptom Fix
Model not found Run ollama pull oamazonasgabriel/qwen2.5-coder-0.5b:fp16-8gbGPU first
Slow on CPU Normal for CPU; enable any GPU with num_gpu 99
FIM not working Use raw /api/generate (not chat) with FIM tokens
Repetitive output Override repeat_penalty to 1.2 or 1.25 at runtime
Nonsensical code Lower temperature to 0.3 or 0.4 at runtime
OOM error Reduce num_ctx — this model is so small OOM is nearly impossible

ALTERNATIVE MODELS

# Larger coding model (needs 16 GB)
ollama pull oamazonasgabriel/qwen3.5-9b:q4-16gbGPU

# Larger MoE model (needs 24 GB)
ollama pull oamazonasgabriel/qwen3.6-35b-a3b:q4-24gbGPU

# Smaller quantized version
ollama pull qwen2.5-coder:0.5b

CREDITS

Role Entity
Base Model Qwen Team, Alibaba Group
Original Model Qwen2.5-Coder-0.5B-Instruct
GGUF Conversion Qwen2.5-Coder-0.5B-Instruct-GGUF
Ollama Packaging impacte.tech
License Apache 2.0

LINKS

Resource URL
This model on Ollama https://ollama.com/oamazonasgabriel/qwen2.5-coder-0.5b
Upstream model (HuggingFace) https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct
GGUF files (HuggingFace) https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct-GGUF
Qwen2.5-Coder paper https://arxiv.org/abs/2409.12186
Ollama docs https://docs.ollama.com
Built by impacte.tech

amazonas@amazonas-home-lab:~/Projects/homelab/ollama-training$