317 Downloads Updated 1 month ago
ollama run iliafed/nemotron-quant
Updated 1 month ago
1 month ago
5f0689c6794b · 24GB ·
Nemotron-H MoE 31B (Q4_K_M) — TurboQuant Benchmark
This model features TurboQuant (TQ) optimization for KV-cache compression. This benchmark compares the Normal mode (F16 KV) against TurboQuant (tbqp3/tbq3) to evaluate memory savings versus processing speed.
🚀 Quick Summary
Memory Efficiency: TurboQuant achieves 5.02x compression of the KV-cache, saving ~80.08% of VRAM.
Generation Speed: Minimal impact. TQ is only ~2.6% slower than native F16 during long generation sequences.
Prefill Performance: Significant trade-off on long contexts. Throughput drops by ~23.88% at 2k context and up to ~83.83% at 32k context.
Best Use Case: Massive context windows where VRAM is the primary bottleneck.
📊 Performance Comparison Metric Normal (F16 KV) TurboQuant (tbqp3/tbq3) Difference KV Cache @ 262k ctx 1,536 MiB 306 MiB -80.08% VRAM Generation (1024 t) 118.63 tok/s 115.54 tok/s -2.60% Prefill (2048 t) 2,235.64 tok/s 1,701.87 tok/s -23.88% Prefill (32768 t) 2,399.69 tok/s 388.09 tok/s -83.83% 🔍 Technical Insights
Architecture Note: The Nemotron-H MoE 31B uses only 6 KV/attention layers. Because of this specific architecture, the massive reduction in memory bandwidth requirements does not translate into a speed boost; instead, the computational overhead of quantization/dequantization leads to a slowdown in prefill.
Auto-Mapping: The engine automatically mapped shorthand types to head_dim-specific variants: tbqp3_1 and tbq3_1 (for head_dim=128).
Hardware Tested: Dual GPU setup (NVIDIA RTX 3090 24GB + RTX 2080 Ti 11GB) on Windows 11.
🛠️ How to use
To achieve the “Normal” baseline results, use: llama-cli -ctk f16 -ctv f16
To enable the “TurboQuant” memory-saving mode, use: llama-cli -ctk tbqp3 -ctv tbq3
💡 Practical Conclusion
Choose Normal Mode for maximum speed during short-to-medium context tasks.
Choose TurboQuant when you need to fit extremely large contexts (up to 256k+) into limited VRAM where F16 would otherwise cause an OOM (Out of Memory) error.