115 1 month ago

Qwen 2.5 3B – NovaForgeAI Edition Qwen 2.5 3B – NovaForgeAI Edition is a CPU-optimized, low-latency LLM designed for fast local inference on low-end and mid-range systems.

ollama run novaforgeai/qwen2.5-3b:q3km

Details

1 month ago

0b1abd52d7a2 · 1.6GB ·

qwen2
·
3.09B
·
Q3_K_M
You are Qwen 2.5, a helpful AI assistant created by the Qwen team at Alibaba Cloud and optimized for
{ "num_batch": 512, "num_ctx": 1024, "num_gpu": 0, "num_thread": 8, "repeat_pena

Readme

Qwen 2.5 3B – NovaForgeAI Edition

CPU-optimized GGUF quantized variants of Qwen 2.5 3B Instruct Optimized for NovaForgeAI Desktop App Maintained by NovaForgeAI Team

🚀 Quick Start

Best quality (recommended)

ollama run novaforgeai/qwen2.5-3b:q4km

Balanced memory & speed

ollama run novaforgeai/qwen2.5-3b:q3km

Smallest & fastest (testing only)

ollama run novaforgeai/qwen2.5-3b:q2k

All variants are CPU-only and work fully offline once downloaded.

📊 Variant Comparison Variant Size RAM Context Speed Quality Recommended Use Q4_K_M ~1.8 GB ~3 GB 2048 Medium ⭐⭐⭐⭐⭐ Production & demos Q3_K_M ~1.5 GB ~2.5 GB 1024 Fast ⭐⭐⭐ Low-RAM systems Q2_K ~1.2 GB ~2 GB 768 Fastest ⭐ Testing only

🏆 Recommended: q4km — best balance of accuracy, stability, and speed.

💡 Choosing the Right Variant ✅ Use Q4_K_M if:

You want accurate & stable answers

You have 4GB+ RAM

Using in presentations or production

⚠️ Use Q3_K_M if:

RAM is limited (3–4 GB)

You prefer speed over depth

❌ Avoid Q2_K for:

Serious reasoning tasks

Long answers or coding (quality loss is significant)

🔧 Technical Overview Base Model

Name: Qwen/Qwen2.5-3B-Instruct

Parameters: 3 Billion

Architecture: Transformer (Qwen2)

Original Context: 32K tokens

License: Qwen Research License

Why Quantization?

The original FP16 model is ~6–7 GB and slow on CPUs. Quantization converts it into compact GGUF files that:

Reduce RAM usage

Increase inference speed

Enable CPU-only execution

🧠 Quantization Explained (Simple) Format What it means Result FP16 Full precision High quality, very slow GGUF llama.cpp optimized format CPU-friendly Q4_K_M Smart mixed 4–6 bit Best balance Q3_K_M More compression Faster, less accurate Q2_K Aggressive compression Fast but unstable

Quantization does NOT retrain the model — it only compresses weights.

🎯 Use Cases

Perfect for:

Local AI assistants

Coding help & explanations

Summarization & translation

Offline & privacy-focused apps

Student & FYP projects

Optimized for:

Low-end CPUs

No GPU

Desktop environments

🧪 Benchmark Summary (CPU) Variant Avg Speed Stability Verdict Q4_K_M ~4.5 tok/s Excellent ✅ Best Q3_K_M ~6 tok/s Moderate ⚠️ Acceptable Q2_K ~7 tok/s Poor ❌ Not usable 📦 Local File Mapping E:\NovaForgeAI\models\quantized
├── qwen2.5-3b-q4km.gguf ├── qwen2.5-3b-q3km.gguf └── qwen2.5-3b-q2k.gguf

These files are referenced directly by Ollama Modelfiles.

🌟 NovaForgeAI Edition Benefits

Clean Ollama tag-based structure

CPU-first tuning

No redundant base models

Professional documentation

Ready for demo, FYP & production

📄 License & Credits

Base Model: Qwen Team (Alibaba Cloud)

Quantization: NovaForgeAI (llama.cpp)

License: Qwen Research License

Status: ✅ Production Ready Optimized for: NovaForgeAI Desktop App Maintained by: NovaForgeAI Team