NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

NVIDIA’s Nemotron 3 Super 120B A12B is an open-weight large language model released on March 11, 2026. It features 120B total parameters with only 12B active per forward pass, delivering exceptional compute efficiency for complex multi-agent applications such as software development and cybersecurity triaging.
The model uses a hybrid Mamba2-Transformer LatentMoE architecture with Multi-Token Prediction (MTP), projecting tokens into a smaller latent dimension for expert routing and computation. This improves accuracy per byte and delivers over 5x throughput compared to the previous Nemotron Super generation. Notably, it is the first model in the Nemotron 3 family pre-trained using NVFP4 quantization — meaning it learned to be accurate within the constraints of 4-bit arithmetic from the first gradient update, not just at inference time.
Nemotron 3 Super supports a native 1 million token context window and responds to queries by first generating a reasoning trace before concluding with a final response, making it purpose-built for long-running autonomous agents and high-volume workloads such as IT ticket automation.
| Specification | Details |
|---|---|
| Architecture | Mamba2-Transformer Hybrid Latent Mixture of Experts (LatentMoE) with Multi-Token Prediction (MTP) |
| Total Parameters | 120 billion |
| Active Parameters | 12 billion (per inference pass) |
| Context Window | Up to 1 million tokens |
| Training Data | 25 trillion tokens |
| Supported Languages | English, French, German, Italian, Japanese, Spanish, Chinese, plus 43 programming languages |
| Pre-training Cutoff | June 2025 |
| Post-training Cutoff | February 2026 |
NVIDIA Nemotron 3 Super 120B is now available across multiple inference providers — but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.
| Provider | Why Notable | Blended ($/1M) | Speed (t/s) | Latency (TTFT) | Context | Tools |
|---|---|---|---|---|---|---|
| DeepInfra | Best price + strong speed/latency balance; supports function calling | $0.20 | 459.3 | 1.01s | 262k | Yes |
| Baseten | Lowest latency (best TTFT) with near-top speed | $0.41 | 479.9 | 0.56s | 203k | No |
| Lightning AI | Fastest output speed (max throughput) | $0.45 | 498.6 | 1.46s | 256k | No |
| Nebius | High speed; supports JSON mode + function calling | $0.45 | 483.7 | 1.62s | 256k | Yes |
| Weights & Biases | Low latency; supports JSON mode + function calling; low throughput | $0.35 | 144.9 | 0.73s | 262k | Yes |
Based on benchmarks across 5 tracked providers, DeepInfra is the recommended API for production-scale Nemotron 3 Super deployment. At $0.20/1M tokens, it is 55% cheaper than the most expensive providers while delivering 459.3 t/s — within 8% of the fastest option. For the lowest latency, Baseten leads at 0.56s TTFT. For maximum raw throughput, Lightning AI leads at 498.6 t/s.
DeepInfra secures the top spot by dominating the economic efficiency of serving Nemotron 3 Super 120B, while maintaining highly competitive performance across every other metric.
The cost delta — $0.25 per million tokens saved compared to the market mean — makes DeepInfra the only logical choice for production-scale deployments. For most RAG or chat applications, the difference between 498 t/s (Lightning AI) and 459 t/s (DeepInfra) is imperceptible, while the 55% cost advantage compounds significantly at volume.
For applications requiring immediate feedback — such as voice-to-voice agents or highly responsive chat interfaces — Baseten is the technical leader.
Baseten’s 0.56s TTFT beats the closest competitor by 0.17s and delivers a genuinely real-time feel for end-users. However, its pricing ($0.41/1M) is more than double DeepInfra’s, and it lacks support for JSON Mode and Function Calling — limiting its viability for complex agentic workflows.
Lightning AI is purpose-built for generation speed, making it the natural choice for high-volume batch processing jobs.
Lightning AI’s 498.6 t/s is the fastest measured, but the 8.5% speed advantage over DeepInfra does not justify the 125% price premium for most use cases. Combined with the lack of JSON Mode and Function Calling, it is best reserved for offline batch workloads where cost is not a constraint.
Nebius occupies a specific niche for developers requiring both Function Calling and JSON Mode — the only provider besides Weights & Biases to support both.
Nebius is worth the premium ($0.45/1M) only if your application strictly requires both JSON Mode and Function Calling. It delivers solid throughput (483.7 t/s) but suffers from the highest latency in the benchmark (1.62s), making it unsuitable for real-time interfaces.
Weights & Biases presents an unusual performance profile, likely acting as a specialized developer-environment endpoint rather than a production inference backend.
Despite strong latency and full feature support, its throughput bottleneck (144.9 t/s) makes it unsuitable for production traffic. It is best suited for short-context developer testing and evaluation environments.
DeepInfra is the cheapest provider at $0.20 blended per 1M tokens — roughly 55% cheaper than Nebius and Lightning AI, and 50% cheaper than Baseten.
Baseten offers the fastest latency with a TTFT of 0.56s, making it ideal for real-time conversational applications.
Yes, DeepInfra supports Function Calling, making it suitable for agentic workflows. Lightning AI and Baseten currently do not support this feature.
Nebius is worth the premium ($0.45/1M) only if your application strictly requires both JSON Mode and Function Calling with no tolerance for prompt-engineering workarounds.
Nemotron 3 Super uses a unique hybrid Mamba2-Transformer LatentMoE architecture, enabling 120B total parameters with only 12B active per inference. This delivers over 5x throughput compared to the previous Nemotron Super, while supporting a native 1M-token context window for long-running autonomous agents.
For the vast majority of Nemotron 3 Super 120B deployments, DeepInfra is the recommended provider. It offers the market’s lowest price ($0.20/1M), strong throughput (459.3 t/s), viable latency (1.01s), and Function Calling support — all without the significant cost premium of the competition.
GLM-5 API Benchmarks: Latency, Throughput & Cost<p>GLM-5 is the latest open-weights reasoning model released by Z AI (Zhipu AI) in February 2026, characterized by high “thinking token” usage. It is a Mixture of Experts (MoE) model with 744B total parameters and 40B active parameters, scaling up from GLM-4.5’s 355B parameters. The model was pre-trained on 28.5T tokens and features a 200K+ […]</p>
Kimi K2 0905 API Benchmarks: Latency, Throughput & Cost<p>About Kimi K2 0905 Kimi K2 0905 is a state-of-the-art large language model developed by Moonshot AI, representing a significant advancement in open-weight AI capabilities. This Mixture-of-Experts (MoE) model features 1 trillion total parameters with 32 billion activated parameters per forward pass, making it highly efficient while maintaining frontier-level performance. The model supports a 256k […]</p>
Qwen3.5 9B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 9B Qwen3.5 9B is the flagship of Alibaba’s Qwen3.5 Small Model Series, released on March 2, 2026. It is a dense multimodal model combining Gated Delta Networks (a form of linear attention) with a sparse Mixture-of-Experts system, enabling higher throughput and lower latency during inference compared to traditional dense architectures. The architecture utilizes […]</p>
© 2026 Deep Infra. All rights reserved.