31 Downloads Updated 12 hours ago
Updated 13 hours ago
13 hours ago
930ceb14a6c5 Β· 4.4GB Β·
Quantized Mistral-7B-Instruct models optimized for fast, local CPU inference using Ollama. Designed for privacy-first, offline, and low-memory systems with multiple performance tiers.
Built and optimized by NovaForgeAI Team
This repository provides multiple quantized variants of Mistral-7B-Instruct-v0.2, converted from FP16 GGUF β optimized GGUF quantizations using llama.cpp.
All models are:
β CPU-only (no GPU required)
β Optimized for low RAM
β Fast TTFT & stable inference
β Ideal for local desktop apps and APIs
π₯ Q2_K β Ultra-Fast Variant
Avg Speed: 7.82 tok/s
TTFT: 18.54s
Best for: speed-critical apps, single-turn queries
RAM: 4β6 GB
Avg Speed: 4.56 tok/s
TTFT: 20.37s
Context: 1024 tokens
Best for: most users, multi-turn chat
Strong reasoning & coding
Best quality-to-size ratio
Recommended for complex prompts
β οΈ Lower quantization = faster but reduced precision Choose variant based on Speed vs Quality needs
Although Mistral supports up to 32K context, NovaForgeAI models intentionally use lower context for maximum speed on CPU.
Example (varies per variant):
PARAMETER num_ctx 512β1024 PARAMETER num_thread 8 PARAMETER num_batch 512 PARAMETER temperature 0.7
Ollama UI may display β32K contextβ (base model capability), but actual runtime context is optimized for performance.
Run Model
Single Prompt
Start server:
Endpoint:
Example request:
π¬ Local chat assistants
π§ Reasoning & Q/A
π§βπ» Code explanation & debugging
π Summarization
π Research assistants
π₯οΈ Electron / Python / React AI apps
Base: Mistral-7B-Instruct-v0.2 (FP16 GGUF)
Tool: llama.cpp (llama-quantize)
Formats: Q2_K / Q3_K_M / Q4_K_M
β No Ollama internal re-quantization
β Fully reproducible builds
Text-only (no vision)
Reduced precision on Q2_K for complex reasoning
Not ideal for high-precision math tasks
Base Model: Mistral AI
Quantization & Packaging: NovaForgeAI
Inference Engine: llama.cpp
Platform: Ollama
π₯ Fastest: Q2_K (low-RAM, instant replies)
β Best Overall: Q3_K_M (recommended)
π§ Best Quality: Q4_K_M (reasoning & coding)
β Q2 / Q3 / Q4 variants
π Vision models
π RAG-optimized builds
π Long-context research models
π NovaForgeAI Desktop integration
Privacy-first β’ CPU-optimized β’ Open Source