iliafed/nemotron-quant

Details

Updated 1 month ago

1 month ago

5f0689c6794b · 24GB ·

model

archnemotron_h_moe

parameters31.6B

quantizationQ4_K_M

24GB

license

NVIDIA Open Model License Agreement Last Modified: October 24, 2025 This NVIDIA Open Model License A

10kB

params

{ "num_ctx": 262144, "temperature": 1, "top_p": 0.95 }

48B

Nemotron-H MoE 31B (Q4_K_M) — TurboQuant Benchmark

This model features TurboQuant (TQ) optimization for KV-cache compression. This benchmark compares the Normal mode (F16 KV) against TurboQuant (tbqp3/tbq3) to evaluate memory savings versus processing speed.

🚀 Quick Summary

Memory Efficiency: TurboQuant achieves 5.02x compression of the KV-cache, saving ~80.08% of VRAM.

Generation Speed: Minimal impact. TQ is only ~2.6% slower than native F16 during long generation sequences.

Prefill Performance: Significant trade-off on long contexts. Throughput drops by ~23.88% at 2k context and up to ~83.83% at 32k context.

Best Use Case: Massive context windows where VRAM is the primary bottleneck.

📊 Performance Comparison Metric Normal (F16 KV) TurboQuant (tbqp3/tbq3) Difference KV Cache @ 262k ctx 1,536 MiB 306 MiB -80.08% VRAM Generation (1024 t) 118.63 tok/s 115.54 tok/s -2.60% Prefill (2048 t) 2,235.64 tok/s 1,701.87 tok/s -23.88% Prefill (32768 t) 2,399.69 tok/s 388.09 tok/s -83.83% 🔍 Technical Insights

Architecture Note: The Nemotron-H MoE 31B uses only 6 KV/attention layers. Because of this specific architecture, the massive reduction in memory bandwidth requirements does not translate into a speed boost; instead, the computational overhead of quantization/dequantization leads to a slowdown in prefill.

Auto-Mapping: The engine automatically mapped shorthand types to head_dim-specific variants: tbqp3_1 and tbq3_1 (for head_dim=128).

Hardware Tested: Dual GPU setup (NVIDIA RTX 3090 24GB + RTX 2080 Ti 11GB) on Windows 11.

🛠️ How to use

To achieve the “Normal” baseline results, use: llama-cli -ctk f16 -ctv f16

To enable the “TurboQuant” memory-saving mode, use: llama-cli -ctk tbqp3 -ctv tbq3

💡 Practical Conclusion

Choose Normal Mode for maximum speed during short-to-medium context tasks.

Choose TurboQuant when you need to fit extremely large contexts (up to 256k+) into limited VRAM where F16 would otherwise cause an OOM (Out of Memory) error.

nemotron TurboQuant

Details

Readme