NVIDIA Nemotron 3 Super 120B API Benchmarks

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.05.25 by DeepInfra

NVIDIA Nemotron 3 Super 120B A12B is available across multiple API providers, and the spread in performance and cost is wide enough to change deployment decisions. Artificial Analysis benchmarks three providers — Lightning AI, CoreWeave, and Nebius — with output speed ranging from 154 to 509 t/s (a 3.3x gap), TTFT spanning 0.98s to 1.94s, and blended pricing from $0.26 to $0.39 per 1M tokens. DeepInfra, covered separately below with independently sourced figures, adds a fourth option worth evaluating — lower blended cost and full function calling support. This breakdown covers what each provider delivers and which workloads they fit.

Nemotron 3 Super 120B — API Review Summary

3 providers benchmarked by Artificial Analysis (May 2026): CoreWeave, Nebius, Lightning AI
Benchmarking workload: 10,000 input tokens (production-reflective default)
Fastest output speed: Lightning AI 509.3 t/s · Nebius 171.6 t/s · CoreWeave 154.4 t/s (3.3x spread)
Lowest TTFT: CoreWeave 0.98s · Lightning AI 1.15s · Nebius 1.94s
Lowest time to first answer token: Lightning AI 5.08s · Nebius 13.59s · CoreWeave 13.93s
Lowest blended price (7:2:1 cache:input:output): CoreWeave $0.26 · Nebius $0.36 · Lightning AI $0.39
All 3 providers support JSON mode; function calling supported by CoreWeave and Nebius (not Lightning AI)
Context windows: 256k–262k tokens depending on provider

Nemotron 3 Super — Best APIs

Provider	Best Fit	Blended ($/1M)	Input ($/1M)	Output ($/1M)	TTFT (s)	Speed (t/s)	Context	JSON	Func
DeepInfra	Lowest blended cost; full function calling + JSON; private endpoints	0.20	0.10	0.50	1.01s	459.3	262k	Yes	Yes
Lightning AI	Best raw throughput + fastest first answer token for reasoning-heavy UX	0.39	0.35	0.75	1.15s	509.3	256k	Yes	No
CoreWeave	Best cost; best TTFT; strong for cost-optimized high-volume workloads	0.26	0.20	0.80	0.98s	154.4	262k	Yes	Yes
Nebius	Balanced option; function calling + competitive blended pricing	0.36	0.30	0.90	1.94s	171.6	256k	Yes	Yes

About NVIDIA Nemotron 3 Super 120B A12B

Nemotron 3 Super 120B A12B is an open-weight reasoning model from NVIDIA, released March 2026. It uses a hybrid Mamba-Transformer Mixture-of-Experts architecture with 120B total parameters and 12B active per inference pass. Key architectural features: LatentMoE for improved routing accuracy, Multi-Token Prediction (MTP) layers enabling native speculative decoding without a separate draft model, and NVFP4 pre-training with a BF16 release checkpoint. The maximum context length is up to 1M tokens (provider availability varies — see individual listings). The model is designed for reasoning, tool use, agentic workflows, and long-context instruction following across English, code, and multiple languages. Weights are available under the NVIDIA Nemotron Open Model License, which permits commercial use.

For a full breakdown of the model’s architecture, benchmarks, and design decisions, see the Nemotron 3 Super release post. For context on how it sits within the broader Nemotron family, the Nemotron 3 Nano explainer covers the tradeoffs between model sizes.

Provider Analyses

Lightning AI — Best Raw Throughput and Fastest First Answer Token

Output Speed: 509.3 t/s (#1 fastest)
TTFT: 1.15s
Time to First Answer Token: 5.08s (#1 fastest)
Blended Price: $0.39/1M (highest in the set)
Input Price: $0.35/1M
Output Price: $0.75/1M
Context Window: 256k tokens
API Features: JSON mode — No function calling

Lightning AI leads the benchmark on both raw output speed (509.3 t/s) and time to first answer token (5.08s) — the latter being the more meaningful latency metric for reasoning models, where the model thinks before responding. For user-facing applications where perceived responsiveness depends on getting an actual answer quickly, that 5.08s figure is the number to focus on. The tradeoff is cost ($0.39/1M blended, highest in the set) and the absence of function calling, which limits it for tool-use and agentic pipelines.

CoreWeave — Best Cost and Lowest TTFT

Output Speed: 154.4 t/s
TTFT: 0.98s (#1 lowest)
Time to First Answer Token: 13.93s
Blended Price: $0.26/1M (#1 lowest)
Input Price: $0.20/1M
Output Price: $0.80/1M
Context Window: 262k tokens
API Features: JSON mode, Function calling

CoreWeave is the cost leader in the Artificial Analysis benchmark set at $0.26/1M blended, with the lowest input price ($0.20/1M) and the fastest TTFT at 0.98s. Its time to first answer token (13.93s) is the highest of the three — meaning the model takes longer to finish reasoning before output begins, which matters for interactive applications. For cost-optimized high-volume workloads where output verbosity is controlled and TTFT matters more than answer latency, CoreWeave is the strongest option in this benchmark set.

Nebius — Balanced Option with Function Calling

Output Speed: 171.6 t/s
TTFT: 1.94s
Time to First Answer Token: 13.59s
Blended Price: $0.36/1M
Input Price: $0.30/1M
Output Price: $0.90/1M
Context Window: 256k tokens
API Features: JSON mode, Function calling

Nebius sits between Lightning AI and CoreWeave on most metrics — mid-pack output speed, the highest TTFT of the three (1.94s), the lowest time to first answer token among providers that support function calling (13.59s), and a $0.36/1M blended price. It is the clearest middle-ground option for teams that need function calling but want better throughput economics than CoreWeave. Output pricing ($0.90/1M) is the highest in the benchmark set, which matters for verbose workloads.

DeepInfra — Lowest Blended Cost with Full Feature Support

DeepInfra is not included in this Artificial Analysis benchmark snapshot, but independently published figures from DeepInfra’s own benchmarking place it as a strong fourth option for production Nemotron 3 Super deployments. The figures below come from DeepInfra’s platform data rather than the same AA crawl used for the three providers above.

Output Speed: 459.3 t/s
TTFT: 1.01s
Blended Price: $0.20/1M (lowest across all four providers)
Input Price: $0.10/1M
Output Price: $0.50/1M
Context Window: 262k tokens
API Features: JSON mode, Function calling, private endpoint deployment

On DeepInfra’s own figures, it delivers the lowest blended price of the four providers evaluated here, with output speed competitive with Lightning AI, sub-second class TTFT, and full function calling support — the only provider in this comparison to combine all three. Private endpoint deployment is also available for teams with data isolation requirements. For a detailed cost breakdown across real workload patterns, see the Nemotron 3 Super pricing guide. To get started, visit the Nemotron 3 Super model page on DeepInfra.

Technical Deep-Dive: What Developers Need to Know

1. TTFT vs. Time to First Answer Token

Nemotron 3 Super is a reasoning model — it thinks before it responds. This creates two distinct latency metrics that developers often conflate:

Time to First Token (TTFT): time from sending a request to receiving the first token back, which for a reasoning model is typically the start of the thinking process, not the start of the actual answer.
Time to First Answer Token: time until the model finishes reasoning and begins generating the final response. For user-facing applications, this is the number that determines perceived latency.

Lightning AI’s 5.08s time to first answer token is dramatically better than CoreWeave’s 13.93s and Nebius’s 13.59s — a gap that matters significantly for interactive applications. For batch or async workloads where answer latency is less visible, TTFT and cost become the more relevant metrics.

2. Function Calling and Agentic Pipelines

Only CoreWeave, Nebius, and DeepInfra support function calling in this comparison set. Lightning AI does not. For agentic workflows where the model invokes external tools, retrieves data, or returns structured action objects, function calling is a hard requirement — not a preference. Lightning AI’s throughput advantage becomes irrelevant if your application depends on tool use. All four providers support JSON mode, so structured output generation is broadly available regardless of provider.

3. Output Verbosity and Cost

Artificial Analysis has noted that Nemotron 3 Super generates unusually high output-token volumes relative to similar open-weight models. With output pricing ranging from $0.50 to $0.90/1M across providers, unconstrained generation length can materially affect your bill. Setting explicit output caps, using structured outputs to reduce rambling, and measuring average output token counts before committing to provider pricing are all worth doing before scaling. For guidance on how to model these costs, see the token math and cost-per-completion guide.

4. Context Window Availability

The model supports up to 1M tokens at the architectural level, but no provider in this benchmark set exposes that full window — context windows range from 256k to 262k depending on provider. For workloads requiring full 1M context, verify availability directly with the provider before building around it.

Conclusion

The right provider for Nemotron 3 Super depends on what your workload prioritizes. Lightning AI leads on raw throughput and first-answer latency but lacks function calling and carries the highest blended price. CoreWeave is the cost leader in the AA benchmark set with the fastest TTFT, but its first-answer latency (13.93s) is the highest of the group. Nebius is the middle-ground option for teams that need function calling and competitive blended pricing with reasonable throughput. DeepInfra, on independently benchmarked figures, adds the lowest blended cost of the four, sub-second class TTFT, and full function calling in one package.

For a full cost comparison across workload types, see the Nemotron 3 Super pricing guide. To evaluate the smaller member of the family for cost-efficient agentic workloads, the Nemotron 3 Nano 30B benchmarks are a useful reference. To get started with Nemotron 3 Super on DeepInfra, visit the model page.

Power the Next Era of Image Generation with FLUX.2 Visual Intelligence on DeepInfraDeepInfra is excited to support FLUX.2 from day zero, bringing the newest visual intelligence model from Black Forest Labs to our platform at launch. We make it straightforward for developers, creators, and enterprises to run the model with high performance, transparent pricing, and an API designed for productivity.

MiMo-V2.5 Is Now Available on DeepInfra<p>Xiaomi’s MiMo-V2.5 collapses what used to require two separate models — frontier agentic capability and native multimodal understanding — into one. Previously, MiMo-V2-Pro handled agentic and coding tasks while MiMo-V2-Omni covered visual and audio inputs; MiMo-V2.5 replaces both. It handles text, images, video, and audio natively, extends context to 1 million tokens, and scores 71.8 […]</p>

Kimi K2.6 API Benchmarks: Latency, TPS & Cost Analysis (2026)<p>About Kimi K2.6 Kimi K2.6 is an open-source frontier model from Moonshot AI, released on April 20, 2026. It is a native multimodal agentic model built for long-horizon coding, autonomous execution, and swarm-based task orchestration. The model uses a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion activated parameters per token, using […]</p>

View all