NVIDIA Nemotron 3 Super: Pricing, benchmarks, and what 12B active parameters actually gets you
NVIDIA just shipped a 120B-parameter open-source model at GTC that only fires 12B parameters per token. On DeepInfra it runs at $0.10 per million input tokens. That's 25x cheaper than GPT-5.4. Here's the catch, and whether it matters for what you're building.

Image source: NVIDIA Blog
TL;DR
- -Model:
nemotron-3-super-120b-a12b, released March 11, 2026 at GTC. - -Pricing: $0.30 / 1M input, $0.80 / 1M output (median across providers). As low as $0.10 / $0.50 on DeepInfra.
- -Architecture: Hybrid Mamba-2 + Transformer MoE. 120B total params, ~12.7B active per token. 1M context window.
- -Speed: 449 output tokens/sec per Artificial Analysis. Ranked #1 in its tier. 2.2x faster than GPT-OSS-120B.
- -Open source: Permissive license. Free weights on Hugging Face. Self-host or use via API.
What is Nemotron 3 Super?
NVIDIA announced Nemotron 3 Super at GTC on March 11, 2026. It's their biggest open-source model, and it's aimed squarely at agentic workloads: multi-step reasoning, tool use, long-context tasks where you need the model to be fast and cheap, not necessarily the smartest thing in the room.
The interesting bit is the architecture. It has 120 billion total parameters, but only about 12.7 billion fire on any given token. That's a Mixture of Experts (MoE) design. The backbone is a hybrid of Mamba-2 layers (state space models with linear-time complexity) and Transformer attention layers interleaved at specific depths for long-context retrieval.
In practice, this means it runs at 449 output tokens per second according to Artificial Analysis, which ranks it #1 of 51 models in its intelligence tier. It holds a 1M-token context window with 91.75% accuracy on the RULER benchmark at full length. Most comparable models fall apart well before that.
Pricing breakdown
Nemotron 3 Super is open-weight, so pricing varies by provider. The weights themselves are free. You're paying for inference compute, and the spread between providers is pretty wide:
| Provider | Input / 1M tokens | Output / 1M tokens |
|---|---|---|
| DeepInfra | $0.10 | $0.50 |
| Fireworks AI | $0.20 | $0.60 |
| Nebius | $0.30 | $0.80 |
| Together AI | $0.30 | $0.80 |
| OpenRouter (free tier) | $0.00 | $0.00 |
| NVIDIA NIM (self-host) | Your GPU cost | Your GPU cost |
Median across providers is $0.30 input / $0.80 output per million tokens. DeepInfra is the cheapest hosted option at $0.10 / $0.50. If you want to self-host, you'll need 8x H100-80GB GPUs at BF16 precision, or fewer on Blackwell hardware with native NVFP4.
How that stacks up against GPT-5.4, Claude, and Gemini
This is where the numbers get ridiculous. At DeepInfra rates, Nemotron 3 Super is 25x cheaper on input than GPT-5.4. Compared to Claude Opus 4.6, it's 150x cheaper on input. Even at the median provider price, it's still 8x under GPT-5.4.
| Model | Input / 1M | Output / 1M | Context | Open source |
|---|---|---|---|---|
| Nemotron 3 Super (DeepInfra) | $0.10 | $0.50 | 1M | Yes |
| Nemotron 3 Super (median) | $0.30 | $0.80 | 1M | Yes |
| GPT-5.4 | $2.50 | $20.00 | 1M+ | No |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 1M | No |
| Claude Opus 4.6 | $15.00 | $75.00 | 1M | No |
| Gemini 3.1 Pro | $2.00 | $10.00 | 1M | No |
| Grok 4.20 | $2.00 | $6.00 | 2M | No |
The obvious question: is there a proportional quality gap? Yes. But the answer depends entirely on what you're doing with it. You can see the full spec sheet on our Nemotron 3 Super model page.
Benchmark numbers that matter
I want to be upfront about this: Nemotron 3 Super is not competing with GPT-5.4 or Claude Opus 4.6 on raw intelligence. Its Artificial Analysis Intelligence Index score is 36, compared to 57 for both GPT-5.4 and Gemini 3.1 Pro. That's a real gap. But the benchmark breakdown tells a more nuanced story.
| Benchmark | Nemotron 3 Super | GPT-OSS-120B | Qwen3.5-122B |
|---|---|---|---|
| SWE-Bench Verified | 60.47% | 41.90% | ~66% |
| RULER (1M tokens) | 91.75% | 22.30% | — |
| RULER (256K tokens) | 96.30% | — | — |
| MMLU-Pro | 83.73 | — | 86.70 |
| AIME 2025 | 90.21% | — | — |
| LiveCodeBench v5 | 81.19% | — | — |
| Arena-Hard V2 | 73.88% | 90.26% | — |
Two things jump out. SWE-Bench Verified at 60.47% is legitimately strong for an open-source model. Nemotron beats GPT-OSS by nearly 19 points here. Qwen3.5-122B still leads at ~66%, but Qwen is also significantly slower.
The RULER numbers are where the Mamba-Transformer hybrid really earns its keep. At 1M tokens, Nemotron holds 91.75% accuracy. GPT-OSS drops to 22.30% at the same length. That's not a subtle difference. If you're processing long documents or feeding agents large context windows, this matters.
Where it falls short: Arena-Hard V2. At 73.88% vs GPT-OSS's 90.26%, Nemotron is clearly worse at conversation. NVIDIA says as much in their technical report. This model is built for agentic execution, not for chatting with users. If you need a model that's pleasant to talk to, look elsewhere.
Speed is the real selling point
449 output tokens per second. That's the Artificial Analysis measurement, and it makes Nemotron the fastest model in its intelligence tier, ranked #1 of 51 models. NVIDIA claims 2.2x the throughput of GPT-OSS-120B and over 5x throughput compared to the previous Nemotron Super.
For agentic workloads, this compounds. If you're chaining four model calls in a pipeline and each one is 2x faster, your total pipeline is 8x faster. When you're running multi-agent systems that make dozens of calls per task, the speed difference between Nemotron and a slower model changes what you can realistically build.
Why it's this fast: Three things working together. Mamba-2 layers process sequences in linear time instead of quadratic. The MoE architecture means only ~12.7B of the 120B parameters activate per token. And NVIDIA pretrained with native NVFP4 precision on Blackwell, so B200 inference is 4x faster than FP8 on H100s. Plus Multi-Token Prediction (MTP) layers let it generate multiple tokens at once.
The hybrid architecture, briefly
Most frontier models are pure Transformers. Nemotron 3 Super isn't. It interleaves Mamba-2 layers with Transformer attention layers and wraps them in a Latent MoE routing scheme.
Mamba-2 layers are state space models. They process sequences in linear time relative to length, which is how the model handles 1M tokens without falling apart. The attention layers are placed at specific depths to handle retrieval: when you need the model to find a specific fact buried deep in a long context, that's the attention layers doing the work.
The Latent MoE part is what keeps inference cheap. Out of 120B total parameters, only ~12.7B are active for any given token. The routing layer decides which experts to activate, which is why you get 120B-class capability without 120B-class compute costs.
Where to access it
It's already available on most major inference providers. A few options depending on what you need:
What this actually costs in practice
Here are three scenarios at DeepInfra rates ($0.10 input / $0.50 output per 1M tokens) compared against GPT-5.4 ($2.50 / $20.00). I double-checked the math.
The savings are real, but they assume Nemotron's quality is sufficient for your task. For coding and long-context work, the benchmarks suggest it often is. For general reasoning and conversation, it's not.
When to use it (and when to skip it)
Honest answer: Nemotron 3 Super is a specialist. It's fast and cheap and open-source, but it's not trying to be the best at everything. Here's my read on where it fits and where it doesn't.
- Multi-agent pipelines where latency compounds
- Coding tasks (60.47% SWE-Bench Verified)
- Long-context processing up to 1M tokens
- High-volume batch jobs where cost is the bottleneck
- Self-hosted setups where you need open weights
- User-facing chat (73.88% Arena-Hard is weak)
- Tasks that need frontier-level reasoning
- Anything where GPT-5.4 or Claude quality is non-negotiable
- Multimodal inputs (text only, no images or audio)
Gun to my head, I'd say the sweet spot is using Nemotron as the workhorse in a pipeline where a smarter model handles the final output. Let Nemotron do the heavy lifting on retrieval, code generation, and tool use. Then route the result to Claude or GPT for anything user-facing. You get 80% of the work done at 3% of the cost.
Worth knowing
Artificial Analysis flagged that Nemotron generated 110M output tokens during their Intelligence Index evaluation, versus a 7.3M average across other models. They describe it as "very verbose." That's worth keeping in mind for cost estimation. If the model tends to over-generate, your actual output token costs could be higher than you'd expect from the benchmark scores alone.
Also: the model is text-in, text-out only. No image, audio, or video inputs. If you need multimodal, this isn't the one.
Bottom line
Nemotron 3 Super is not going to replace GPT-5.4 or Claude for general-purpose work. It scores lower on intelligence benchmarks and it's noticeably weaker at conversation. That's not what it's for.
What it is: the fastest open-source model with a 1M context window that actually maintains accuracy at that length. At $0.10 per million input tokens on DeepInfra, it's 25-30x cheaper than GPT-5.4 for most workloads. For agentic pipelines, coding tasks, and long-context batch processing, those economics change what you can afford to build.
Check the full specs on the Nemotron 3 Super model page, see how it fits alongside 60+ other models on our pricing table, or plug in your own usage numbers with the cost calculator.
Sources
- NVIDIA Technical Blog: Introducing Nemotron 3 Super (March 11, 2026)
- NVIDIA Blog: Nemotron 3 Super Delivers 5x Higher Throughput (March 11, 2026)
- NVIDIA Research: Nemotron 3 Super Technical Report (model card and benchmarks)
- Artificial Analysis: Nemotron 3 Super 120B A12B (independent benchmarks, speed, and provider pricing)
- Hugging Face: Nemotron 3 Super Weights (open-source model download)
- NVIDIA NIM: Nemotron 3 Super Model Card (specs and deployment)