We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

NVIDIA Nemotron 3 Super 120B API Benchmarks: Latency & Cost
Published on 2026.04.03 by DeepInfra
NVIDIA Nemotron 3 Super 120B API Benchmarks: Latency & Cost

About NVIDIA Nemotron 3 Super 120B A12B

NVIDIA’s Nemotron 3 Super 120B A12B is an open-weight large language model released on March 11, 2026. It features 120B total parameters with only 12B active per forward pass, delivering exceptional compute efficiency for complex multi-agent applications such as software development and cybersecurity triaging.

The model uses a hybrid Mamba2-Transformer LatentMoE architecture with Multi-Token Prediction (MTP), projecting tokens into a smaller latent dimension for expert routing and computation. This improves accuracy per byte and delivers over 5x throughput compared to the previous Nemotron Super generation. Notably, it is the first model in the Nemotron 3 family pre-trained using NVFP4 quantization — meaning it learned to be accurate within the constraints of 4-bit arithmetic from the first gradient update, not just at inference time.

Nemotron 3 Super supports a native 1 million token context window and responds to queries by first generating a reasoning trace before concluding with a final response, making it purpose-built for long-running autonomous agents and high-volume workloads such as IT ticket automation.

SpecificationDetails
ArchitectureMamba2-Transformer Hybrid Latent Mixture of Experts (LatentMoE) with Multi-Token Prediction (MTP)
Total Parameters120 billion
Active Parameters12 billion (per inference pass)
Context WindowUp to 1 million tokens
Training Data25 trillion tokens
Supported LanguagesEnglish, French, German, Italian, Japanese, Spanish, Chinese, plus 43 programming languages
Pre-training CutoffJune 2025
Post-training CutoffFebruary 2026

NVIDIA Nemotron 3 Super 120B is now available across multiple inference providers — but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.

NVIDIA Nemotron 3 Super 120B API Review Summary

  • DeepInfra is the lowest-cost provider at $0.20 / 1M tokens (blended) — approximately 2.25x cheaper than the highest-priced options (Nebius and Lightning AI at $0.45).
  • DeepInfra delivers strong throughput: 459.3 tokens/sec, within 8% of the fastest provider (Lightning AI at 498.6 t/s).
  • DeepInfra is competitive on interactivity: 1.01s TTFT (3rd best), behind Baseten (0.56s) and Weights & Biases (0.73s).
  • Provider performance varies widely: output speed ranges from 498.6 t/s to 144.9 t/s — a ~3.4x spread — so provider choice materially impacts UX and throughput.
  • API feature coverage is mixed: function calling is supported by 3 of 5 providers (Weights & Biases, DeepInfra, Nebius); JSON mode by 2 of 5 (Weights & Biases, Nebius).
  • Benchmarks reflect sustained performance: median (P50) over the past 72 hours using a 10,000 input-token workload.

NVIDIA Nemotron 3 Super 120B — Best APIs

ProviderWhy NotableBlended ($/1M)Speed (t/s)Latency (TTFT)ContextTools
DeepInfraBest price + strong speed/latency balance; supports function calling$0.20459.31.01s262kYes
BasetenLowest latency (best TTFT) with near-top speed$0.41479.90.56s203kNo
Lightning AIFastest output speed (max throughput)$0.45498.61.46s256kNo
NebiusHigh speed; supports JSON mode + function calling$0.45483.71.62s256kYes
Weights & BiasesLow latency; supports JSON mode + function calling; low throughput$0.35144.90.73s262kYes

Quick Verdict: Which Nemotron 3 Super Provider is Best?

Based on benchmarks across 5 tracked providers, DeepInfra is the recommended API for production-scale Nemotron 3 Super deployment. At $0.20/1M tokens, it is 55% cheaper than the most expensive providers while delivering 459.3 t/s — within 8% of the fastest option. For the lowest latency, Baseten leads at 0.56s TTFT. For maximum raw throughput, Lightning AI leads at 498.6 t/s.

Overall Winner: DeepInfra

DeepInfra secures the top spot by dominating the economic efficiency of serving Nemotron 3 Super 120B, while maintaining highly competitive performance across every other metric.

  • Input Price: $0.10 / 1M tokens
  • Output Price: $0.50 / 1M tokens
  • Blended Price: $0.20 / 1M tokens (cheapest on the market)
  • Output Speed: 459.3 t/s (within 8% of the fastest provider)
  • Latency (TTFT): 1.01s (3rd best overall)
  • Context Window: 262k tokens
  • API Features: Function Calling supported

The cost delta — $0.25 per million tokens saved compared to the market mean — makes DeepInfra the only logical choice for production-scale deployments. For most RAG or chat applications, the difference between 498 t/s (Lightning AI) and 459 t/s (DeepInfra) is imperceptible, while the 55% cost advantage compounds significantly at volume.

Best for Low Latency: Baseten

For applications requiring immediate feedback — such as voice-to-voice agents or highly responsive chat interfaces — Baseten is the technical leader.

  • Latency (TTFT): 0.56s (fastest in the benchmark)
  • Output Speed: 479.9 t/s (#3 overall)
  • Blended Price: $0.41 / 1M tokens
  • API Features: No JSON Mode or Function Calling

Baseten’s 0.56s TTFT beats the closest competitor by 0.17s and delivers a genuinely real-time feel for end-users. However, its pricing ($0.41/1M) is more than double DeepInfra’s, and it lacks support for JSON Mode and Function Calling — limiting its viability for complex agentic workflows.

Best for Raw Throughput: Lightning AI

Lightning AI is purpose-built for generation speed, making it the natural choice for high-volume batch processing jobs.

  • Output Speed: 498.6 t/s (fastest in the benchmark)
  • Latency (TTFT): 1.46s (one of the slower starts)
  • Blended Price: $0.45 / 1M tokens (tied most expensive)
  • API Features: No JSON Mode or Function Calling

Lightning AI’s 498.6 t/s is the fastest measured, but the 8.5% speed advantage over DeepInfra does not justify the 125% price premium for most use cases. Combined with the lack of JSON Mode and Function Calling, it is best reserved for offline batch workloads where cost is not a constraint.

Feature-Rich Alternative: Nebius

Nebius occupies a specific niche for developers requiring both Function Calling and JSON Mode — the only provider besides Weights & Biases to support both.

  • Output Speed: 483.7 t/s (#2 overall)
  • Latency (TTFT): 1.62s (highest in the benchmark)
  • Blended Price: $0.45 / 1M tokens (tied most expensive)
  • API Features: JSON Mode + Function Calling

Nebius is worth the premium ($0.45/1M) only if your application strictly requires both JSON Mode and Function Calling. It delivers solid throughput (483.7 t/s) but suffers from the highest latency in the benchmark (1.62s), making it unsuitable for real-time interfaces.

Developer Alternative: Weights & Biases

Weights & Biases presents an unusual performance profile, likely acting as a specialized developer-environment endpoint rather than a production inference backend.

  • Output Speed: 144.9 t/s (significantly slower than the rest of the field — ~3.4x below the leader)
  • Latency (TTFT): 0.73s (#2 lowest)
  • Blended Price: $0.35 / 1M tokens
  • API Features: JSON Mode + Function Calling

Despite strong latency and full feature support, its throughput bottleneck (144.9 t/s) makes it unsuitable for production traffic. It is best suited for short-context developer testing and evaluation environments.

Frequently Asked Questions

Which API provider is cheapest for NVIDIA Nemotron 3 Super?

DeepInfra is the cheapest provider at $0.20 blended per 1M tokens — roughly 55% cheaper than Nebius and Lightning AI, and 50% cheaper than Baseten.

Which provider has the fastest Time to First Token (TTFT)?

Baseten offers the fastest latency with a TTFT of 0.56s, making it ideal for real-time conversational applications.

Does DeepInfra support Function Calling for Nemotron 3 Super?

Yes, DeepInfra supports Function Calling, making it suitable for agentic workflows. Lightning AI and Baseten currently do not support this feature.

Is Nebius worth the extra cost?

Nebius is worth the premium ($0.45/1M) only if your application strictly requires both JSON Mode and Function Calling with no tolerance for prompt-engineering workarounds.

What makes Nemotron 3 Super different from other reasoning models?

Nemotron 3 Super uses a unique hybrid Mamba2-Transformer LatentMoE architecture, enabling 120B total parameters with only 12B active per inference. This delivers over 5x throughput compared to the previous Nemotron Super, while supporting a native 1M-token context window for long-running autonomous agents.

Conclusion

For the vast majority of Nemotron 3 Super 120B deployments, DeepInfra is the recommended provider. It offers the market’s lowest price ($0.20/1M), strong throughput (459.3 t/s), viable latency (1.01s), and Function Calling support — all without the significant cost premium of the competition.

  • Choose DeepInfra for the best overall value — lowest cost, strong throughput, and function calling support.
  • Choose Baseten if your application is latency-critical and every millisecond of TTFT counts.
  • Choose Lightning AI for pure bulk text generation where speed is the sole metric and cost is not a constraint.
  • Choose Nebius if native JSON Mode and Function Calling are both non-negotiable requirements.
Related articles
GLM-5 API Benchmarks: Latency, Throughput & CostGLM-5 API Benchmarks: Latency, Throughput & Cost<p>GLM-5 is the latest open-weights reasoning model released by Z AI (Zhipu AI) in February 2026, characterized by high &#8220;thinking token&#8221; usage. It is a Mixture of Experts (MoE) model with 744B total parameters and 40B active parameters, scaling up from GLM-4.5&#8217;s 355B parameters. The model was pre-trained on 28.5T tokens and features a 200K+ [&hellip;]</p>
Kimi K2 0905 API Benchmarks: Latency, Throughput & CostKimi K2 0905 API Benchmarks: Latency, Throughput & Cost<p>About Kimi K2 0905 Kimi K2 0905 is a state-of-the-art large language model developed by Moonshot AI, representing a significant advancement in open-weight AI capabilities. This Mixture-of-Experts (MoE) model features 1 trillion total parameters with 32 billion activated parameters per forward pass, making it highly efficient while maintaining frontier-level performance. The model supports a 256k [&hellip;]</p>
Qwen3.5 9B API Benchmarks: Latency, Throughput & CostQwen3.5 9B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 9B Qwen3.5 9B is the flagship of Alibaba&#8217;s Qwen3.5 Small Model Series, released on March 2, 2026. It is a dense multimodal model combining Gated Delta Networks (a form of linear attention) with a sparse Mixture-of-Experts system, enabling higher throughput and lower latency during inference compared to traditional dense architectures. The architecture utilizes [&hellip;]</p>