31 12 hours ago

Quantized Mistral 7B Instruct models optimized for fast, CPU-only local inference with Ollama. Multiple variants balancing speed, quality, and memory efficiency.

12 hours ago

3b537ab1f812 Β· 3.5GB Β·

llama
Β·
7.24B
Β·
Q3_K_M
[INST] {{ range $index, $_ := .Messages }} {{- if eq .Role "system" }}{{ .Content }} {{ else if eq .
You are NovaForge Mistral, a fast and efficient AI assistant created by the NovaForgeAI team. You ar
{ "num_batch": 512, "num_ctx": 1024, "num_gpu": 0, "num_predict": 256, "num_thre

Readme

🧠 NovaForgeAI – Mistral 7B Instruct (CPU-Optimized GGUF)

Quantized Mistral-7B-Instruct models optimized for fast, local CPU inference using Ollama. Designed for privacy-first, offline, and low-memory systems with multiple performance tiers.

Built and optimized by NovaForgeAI Team

πŸš€ Overview

This repository provides multiple quantized variants of Mistral-7B-Instruct-v0.2, converted from FP16 GGUF β†’ optimized GGUF quantizations using llama.cpp.

All models are:

βœ… CPU-only (no GPU required)

βœ… Optimized for low RAM

βœ… Fast TTFT & stable inference

βœ… Ideal for local desktop apps and APIs

image.png

⚑ Performance Summary (CPU-Only)

πŸ”₯ Q2_K – Ultra-Fast Variant

Avg Speed: 7.82 tok/s

TTFT: 18.54s

Best for: speed-critical apps, single-turn queries

RAM: 4–6 GB

βš–οΈ Q3_K_M – Balanced Variant

Avg Speed: 4.56 tok/s

TTFT: 20.37s

Context: 1024 tokens

Best for: most users, multi-turn chat

🧠 Q4_K_M – Quality-Focused Variant

Strong reasoning & coding

Best quality-to-size ratio

Recommended for complex prompts

image.png

⚠️ Lower quantization = faster but reduced precision Choose variant based on Speed vs Quality needs

βš™οΈ Runtime Configuration (Important)

Although Mistral supports up to 32K context, NovaForgeAI models intentionally use lower context for maximum speed on CPU.

Example (varies per variant):

PARAMETER num_ctx 512–1024 PARAMETER num_thread 8 PARAMETER num_batch 512 PARAMETER temperature 0.7

Ollama UI may display β€œ32K context” (base model capability), but actual runtime context is optimized for performance.

▢️ How to Use

Run Model image.png

Single Prompt image.png

πŸ”Œ API Usage

Start server:

image.png

Endpoint:

image.png

Example request: image.png

🧩 Use Cases

πŸ’¬ Local chat assistants

🧠 Reasoning & Q/A

πŸ§‘β€πŸ’» Code explanation & debugging

πŸ“ Summarization

πŸ“š Research assistants

πŸ–₯️ Electron / Python / React AI apps

πŸ› οΈ Quantization Pipeline

Base: Mistral-7B-Instruct-v0.2 (FP16 GGUF)

Tool: llama.cpp (llama-quantize)

Formats: Q2_K / Q3_K_M / Q4_K_M

❌ No Ollama internal re-quantization

βœ… Fully reproducible builds

⚠️ Limitations

Text-only (no vision)

Reduced precision on Q2_K for complex reasoning

Not ideal for high-precision math tasks

πŸ“œ License & Credits

Base Model: Mistral AI

Quantization & Packaging: NovaForgeAI

Inference Engine: llama.cpp

Platform: Ollama

πŸš€ Model Recommendations

πŸ”₯ Fastest: Q2_K (low-RAM, instant replies)

⭐ Best Overall: Q3_K_M (recommended)

🧠 Best Quality: Q4_K_M (reasoning & coding)

πŸ›£οΈ Roadmap (Updated)

βœ… Q2 / Q3 / Q4 variants

πŸ”œ Vision models

πŸ”œ RAG-optimized builds

πŸ”œ Long-context research models

πŸ”œ NovaForgeAI Desktop integration

NovaForgeAI

Privacy-first β€’ CPU-optimized β€’ Open Source