342 3 hours ago

NVIDIA Nemotron 3 Super is a 120B open MoE model activating just 12B parameters to deliver maximum compute efficiency and accuracy for complex multi-agent applications.

tools thinking cloud 120b
ollama run nemotron-3-super

Applications

Claude Code
Claude Code ollama launch claude --model nemotron-3-super
Codex
Codex ollama launch codex --model nemotron-3-super
OpenCode
OpenCode ollama launch opencode --model nemotron-3-super
OpenClaw
OpenClaw ollama launch openclaw --model nemotron-3-super

Models

View all →

Readme

NVIDIA Nemotron™ is a family of open models with open weights, training data, and recipes, delivering leading efficiency and accuracy for building specialized AI agents.

Nemotron-3-Super is a large language model (LLM) trained by NVIDIA, designed to deliver strong agentic, reasoning, and conversational capabilities. It is optimized for collaborative agents and high-volume workloads such as IT ticket automation. Like other models in the family, it responds to user queries and tasks by first generating a reasoning trace and then concluding with a final response. The model’s reasoning capabilities can be configured through a flag in the chat template.

The model has 12B active parameters and 120B parameters in total.

The supported languages include: English, French, German, Italian, Japanese, Spanish, and Chinese

This model is ready for commercial use.

nemotron 3 super

Benchmarks

Benchmark Nemotron-3-Super Nemotron-3-Super FP8 Nemotron-3-Super NVFP4
General Knowledge
MMLU-Pro 83.73 83.63 83.33
Reasoning
HMMT Feb25 (with tools) 94.73 94.38 95.36
GPQA (no tools) 79.23 79.36 79.42
LiveCodeBench (v6 2024-08↔2025-05) 78.69 78.44 78.44
LiveCodeBench (v5 2024-07↔2024-12) 81.19 80.99 80.56
SciCode (subtask) 42.05 41.38 40.83
HLE (no tools) 18.26 17.42 17.42
Agentic
Terminal Bench (hard subset) 25.78 26.04 24.48
TauBench V2
Airline 56.25 56.25 54.75
Retail 62.83 63.05 63.38
Telecom 64.36 63.93 63.27
Average 61.15 61.07 60.46
Chat & Instruction Following
IFBench (prompt) 72.58 72.32 73.30
Scale AI Multi-Challenge 55.23 54.35 52.8
Arena-Hard-V2 (Hard Prompt) 73.88 76.06 76.00
Long Context
AA-LCR 58.31 57.69 58.06
RULER-500 @ 128k (500 samples per task) 96.79 96.85 95.99
RULER-500 @ 256k (500 samples per task) 96.60 96.33 96.52
RULER-500 @ 512k (500 samples per task) 96.09 95.66 96.23
Multilingual
MMLU-ProX (avg over languages) 79.35 79.21 79.37