novaforgeai/novaforge-mistral:7b-q4km

Quantized Mistral 7B Instruct models optimized for fast, CPU-only local inference with Ollama. Multiple variants balancing speed, quality, and memory efficiency.

Details

Updated 1 month ago

1 month ago

930ceb14a6c5 · 4.4GB ·

model

archllama

parameters7.24B

quantizationQ4_K_M

4.4GB

template

[INST] {{ range $index, $_ := .Messages }} {{- if eq .Role "system" }}{{ .Content }} {{ else if eq .

218B

system

You are NovaForge Mistral, a high-quality AI assistant created by the NovaForgeAI team. You are opti

1.3kB

params

{ "num_batch": 512, "num_ctx": 512, "num_gpu": 0, "num_predict": 200, "num_threa

214B

🧠 NovaForgeAI – Mistral 7B Instruct (CPU-Optimized GGUF)

Quantized Mistral-7B-Instruct models optimized for fast, local CPU inference using Ollama. Designed for privacy-first, offline, and low-memory systems with multiple performance tiers.

Built and optimized by NovaForgeAI Team

🚀 Overview

This repository provides multiple quantized variants of Mistral-7B-Instruct-v0.2, converted from FP16 GGUF → optimized GGUF quantizations using llama.cpp.

All models are:

✅ CPU-only (no GPU required)

✅ Optimized for low RAM

✅ Fast TTFT & stable inference

✅ Ideal for local desktop apps and APIs

⚡ Performance Summary (CPU-Only)

🔥 Q2_K – Ultra-Fast Variant

Avg Speed: 7.82 tok/s

TTFT: 18.54s

Best for: speed-critical apps, single-turn queries

RAM: 4–6 GB

⚖️ Q3_K_M – Balanced Variant

Avg Speed: 4.56 tok/s

TTFT: 20.37s

Context: 1024 tokens

Best for: most users, multi-turn chat

🧠 Q4_K_M – Quality-Focused Variant

Strong reasoning & coding

Best quality-to-size ratio

Recommended for complex prompts

⚠️ Lower quantization = faster but reduced precision Choose variant based on Speed vs Quality needs

⚙️ Runtime Configuration (Important)

Although Mistral supports up to 32K context, NovaForgeAI models intentionally use lower context for maximum speed on CPU.

Example (varies per variant):

PARAMETER num_ctx 512–1024 PARAMETER num_thread 8 PARAMETER num_batch 512 PARAMETER temperature 0.7

Ollama UI may display “32K context” (base model capability), but actual runtime context is optimized for performance.

▶️ How to Use

Run Model

Single Prompt

🔌 API Usage

Start server:

Endpoint:

Example request:

🧩 Use Cases

💬 Local chat assistants

🧠 Reasoning & Q/A

🧑‍💻 Code explanation & debugging

📝 Summarization

📚 Research assistants

🖥️ Electron / Python / React AI apps

🛠️ Quantization Pipeline

Base: Mistral-7B-Instruct-v0.2 (FP16 GGUF)

Tool: llama.cpp (llama-quantize)

Formats: Q2_K / Q3_K_M / Q4_K_M

❌ No Ollama internal re-quantization

✅ Fully reproducible builds

⚠️ Limitations

Text-only (no vision)

Reduced precision on Q2_K for complex reasoning

Not ideal for high-precision math tasks

📜 License & Credits

Base Model: Mistral AI

Quantization & Packaging: NovaForgeAI

Inference Engine: llama.cpp

Platform: Ollama

🚀 Model Recommendations

🔥 Fastest: Q2_K (low-RAM, instant replies)

⭐ Best Overall: Q3_K_M (recommended)

🧠 Best Quality: Q4_K_M (reasoning & coding)

🛣️ Roadmap (Updated)

✅ Q2 / Q3 / Q4 variants

🔜 Vision models

🔜 RAG-optimized builds

🔜 Long-context research models

🔜 NovaForgeAI Desktop integration

NovaForgeAI

Privacy-first • CPU-optimized • Open Source