We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Nemotron 3 Super Provider Pricing Comparison (2026)
Published on 2026.05.25 by DeepInfra
Nemotron 3 Super Provider Pricing Comparison (2026)

Nemotron 3 Super is available from multiple providers, and the price spread is real: OpenRouter lists $0.09/$0.45 per 1M input/output tokens, DeepInfra lists $0.10/$0.50, and the Artificial Analysis median across all providers sits at $0.30/$0.75. The right provider depends on what your workload actually looks like — context requirements, output verbosity, and whether you need production features like JSON mode, function calling, or private endpoints.

The model is NVIDIA Nemotron 3 Super 120B A12B, released March 11, 2026 under the NVIDIA Nemotron Open Model License (commercial use permitted). It is an open-weight reasoning model with roughly 120B total parameters and 12B active per inference pass, built for reasoning, tool use, and instruction following. The full release announcement and the NVIDIA Nemotron API pricing guide cover the broader family context.

Nemotron 3 Super Executive Summary

Best for teams that want an open-weight NVIDIA reasoning model without immediately taking on self-hosting complexity. OpenRouter is the lowest published-price option at $0.09/$0.45 per 1M tokens. DeepInfra is close at $0.10/$0.50 and adds public and private endpoint options plus JSON and function calling support. If you care most about production-friendly integration, DeepInfra is the more balanced pick; if raw token cost is the only variable, OpenRouter leads.

Best ForProviderWhy
Managed production deployments with structured outputsDeepInfraPublic and private endpoints, JSON and function calling, $0.10/$0.50 per 1M tokens.
Lowest price / cost-sensitive workloadsOpenRouterLowest listed price at $0.09/$0.45 per 1M tokens.
Easiest onboarding / fastest time-to-first-callOpenRouterMulti-provider routing, free trial endpoint for non-production testing.
RAG, document-heavy, or high-throughput use casesDeepInfraPrompt caching support combined with JSON and function calling for production retrieval and agent pipelines.
Long-context experiments and agent workflowsOpenRouterLists a 1M-token context window for multi-agent applications.
Security-conscious teams that still want hosted inferenceDeepInfraSOC 2 and ISO 27001 certified; private endpoint deployment available.

Understanding Tokens and How You’re Charged

Token pricing is where Nemotron 3 Super gets deceptively simple. The list prices look low. The bill depends on how much text you send, how much the model sends back, and whether your provider discounts repeated prompt prefixes. That matters more here because Artificial Analysis describes Nemotron 3 Super as unusually verbose relative to similar open-weight models — which means output token spend compounds faster than the sticker price suggests. For a deeper primer on how token economics work in practice, see the token math and cost-per-completion guide.

Token typeWhat it isWhy it matters
Input tokensTokens you send in the prompt, system message, tool results, chat history, and retrieved documentsBaseline request cost. In RAG and agent systems, input grows quietly over time as every extra document chunk and prior turn gets billed again.
Output tokensTokens the model generates backThe expensive side for Nemotron 3 Super — output pricing is 5x input, and the model tends to generate a lot.
Cached input tokensReused prompt tokens some providers bill at a lower rate when the same prefix is sent repeatedlyCan materially reduce cost for repeated system prompts, long boilerplate context, or stable document prefixes.
Context window tokensTotal tokens the model can consider in one request, including both input and generated outputBigger context is useful, but it also makes it easy to accidentally send massive prompts and pay for them.

The output-to-input ratio is the key number

On both OpenRouter and DeepInfra, output tokens cost 5x input tokens. On the Artificial Analysis median, output is 2.5x input. A quick mental model:

  • Short prompt + short answer = very cheap
  • Long prompt + short answer = usually still manageable
  • Short prompt + long answer = where Nemotron 3 Super starts costing more than the sticker price suggests
  • Long prompt + long answer = budget review meeting

If you only remember one thing about Nemotron 3 Super pricing: choose the provider on per-token price, choose the app design on output control. The second decision usually saves more money than the first.

Provider Token Cost Tradeoffs for Nemotron 3 Super

For a broader look at how prices compare across the full Nemotron lineup, the NVIDIA Nemotron API pricing guide covers the family-wide cost picture.

ProviderInput /1MOutput /1MAdvantagesDisadvantages
OpenRouter$0.09$0.45Lowest listed price in this research. Good for cost-sensitive workloads if prompts and output length are disciplined. Lists 1M-token context window.Output is still 5x input. Free tier logs prompts/outputs — not suitable for sensitive production workloads.
DeepInfra$0.10$0.50Nearly as cheap as OpenRouter. Adds prompt caching (Prompt Cache Key), JSON and function calling, private endpoints, SOC 2 / ISO 27001.Slightly higher on both sides. Published context window is 262,144 tokens on the hosted page, not 1M.
Artificial Analysis median$0.30$0.75Useful as a reality check for what this model often costs outside the cheapest endpoints. Cache-hit median of $0.20/1M.Much more expensive than OpenRouter and DeepInfra. Choosing a random provider instead of the cheapest can mean 3x higher input costs.

Where each provider tends to make sense

  • OpenRouter: best when you want the lowest published per-token rate. Works well for chat, batch inference, or agent systems where output length is aggressively capped.
  • DeepInfra: better for repeated-prefix workloads where prompt caching reduces input spend; better for production pipelines that benefit from JSON or function calling, which can keep the model from generating more tokens than your application needs.
  • Other providers near the median: hard to justify on cost alone. A 3x jump on input and materially higher output means paying noticeably more for the same traffic.

Two cost traps especially relevant for Nemotron 3 Super

  • Verbose output: Artificial Analysis observed unusually high output-token generation relative to similar models. If you let the model think out loud or return oversized answers, output billing becomes the main driver.
  • Long context complacency: a big context window is useful. It is also how teams accidentally build an efficient way to pay for irrelevant tokens. “Fits in context” and “is economical” are different questions.

Practical ways to keep token spend under control

  • Cap max_tokens or max_new_tokens aggressively
  • Ask for concise answers unless detail is required
  • Use structured outputs when possible
  • Trim chat history instead of replaying full transcripts
  • Deduplicate RAG chunks before sending them
  • Cache stable prompt prefixes where the provider supports it
  • Measure output-token averages before committing to provider pricing based on input alone

DeepInfra: the Power User’s Choice for Nemotron 3 Super

DeepInfra is built on bare-metal infrastructure, which translates into lower latency, better hardware utilization, and more predictable costs at scale — typically 50–80% cheaper than major cloud competitors. For Nemotron 3 Super specifically, it hits a useful middle ground: low pricing, production-friendly deployment options, and features that reduce token waste in real workloads. For a look at how DeepInfra compares against other open-weight reasoning model options, see the open vs. closed source model guide.

ModelBest Use CaseContext WindowInput ($/1M)Output ($/1M)
NVIDIA Nemotron 3 Super 120B A12BReasoning, tool use, and multi-agent production workloads262,144 tokens$0.10$0.50

DeepInfra prices Nemotron 3 Super at $0.10/1M input and $0.50/1M output — only slightly above the cheapest listed endpoint, while adding private endpoint deployment, JSON responses, function calling, and prompt caching support. Teams evaluating smaller-footprint alternatives in the same family can also look at the Nemotron 3 Nano 30B A3B for cost-efficient agentic workloads.

Real-World Cost Scenarios for Developers

Scenario 1: Structured support copilot with concise JSON replies

An internal support assistant that reads ticket context, classifies the issue, and returns a short structured payload for downstream automation. DeepInfra’s JSON support keeps outputs tight and machine-readable rather than paying for long free-form responses.

MetricValue
Volume1,000,000 requests/month
ModelNVIDIA Nemotron 3 Super 120B A12B
ProviderDeepInfra
Input Tokens200,000,000
Output Tokens50,000,000
Monthly Cost$45.00

Cost breakdown:

  • Input: 200M × $0.10/1M = $20.00
  • Output: 50M × $0.50/1M = $25.00
  • Total: $45.00/month

Comparison: The same workload at the Artificial Analysis median price would cost $97.50/month — $52.50 more.

Scenario 2: RAG app with repeated system prompts and stable document prefixes

A production RAG service that reuses the same system instructions, formatting rules, and retrieval scaffolding across large numbers of calls. DeepInfra’s Prompt Cache Key support is directly useful here for repeated-prefix workloads. For background on how RAG cost patterns behave at scale, the LLM API provider performance KPIs guide covers the relevant metrics.

MetricValue
Volume500,000 requests/month
ModelNVIDIA Nemotron 3 Super 120B A12B
ProviderDeepInfra
Input Tokens1,000,000,000
Output Tokens100,000,000
Monthly Cost$150.00

Cost breakdown:

  • Input: 1B × $0.10/1M = $100.00
  • Output: 100M × $0.50/1M = $50.00
  • Total: $150.00/month

Comparison: The same workload at the Artificial Analysis median price would cost $375.00/month — $225.00 more.

Scenario 3: Tool-calling agent backend for engineering workflows

A tool-using agent that inspects logs, calls internal APIs, and returns action objects to your application. DeepInfra’s function calling support pushes the model toward short tool arguments and structured responses instead of paying for rambling prose. For teams weighing whether a smaller model might handle some of these agent tasks, the Nemotron 3 Nano vs GPT-OSS-20B comparison is a useful reference.

MetricValue
Volume200,000 agent runs/month
ModelNVIDIA Nemotron 3 Super 120B A12B
ProviderDeepInfra
Input Tokens600,000,000
Output Tokens120,000,000
Monthly Cost$120.00

Cost breakdown:

  • Input: 600M × $0.10/1M = $60.00
  • Output: 120M × $0.50/1M = $60.00
  • Total: $120.00/month

Comparison: The same workload at the Artificial Analysis median price would cost $270.00/month — $150.00 more.

Scenario 4: Private-endpoint deployment for a security-conscious internal app

Teams that want hosted inference but still need a more controlled deployment path — internal copilots, document QA, or operational workflows. DeepInfra offers private endpoint deployment while keeping Nemotron 3 Super pricing close to the cheapest option in this dataset.

MetricValue
Volume5,000,000 requests/month
ModelNVIDIA Nemotron 3 Super 120B A12B
ProviderDeepInfra
Input Tokens2,500,000,000
Output Tokens250,000,000
Monthly Cost$375.00

Cost breakdown:

  • Input: 2.5B × $0.10/1M = $250.00
  • Output: 250M × $0.50/1M = $125.00
  • Total: $375.00/month

Comparison: The same workload at the Artificial Analysis median price would cost $937.50/month — $562.50 more.

Scenario 5: High-volume summarization pipeline with controlled output caps

Batch summarization of documents, tickets, transcripts, or knowledge-base updates. Nemotron 3 Super works well here only if generations are kept short. For teams considering routing high-volume batch work to a smaller, faster reasoner, the Nemotron 3 Nano explainer covers where the smaller sibling makes economic sense.

MetricValue
Volume2,000,000 summaries/month
ModelNVIDIA Nemotron 3 Super 120B A12B
ProviderDeepInfra
Input Tokens800,000,000
Output Tokens80,000,000
Monthly Cost$120.00

Cost breakdown:

  • Input: 800M × $0.10/1M = $80.00
  • Output: 80M × $0.50/1M = $40.00
  • Total: $120.00/month

Comparison: The same workload at the Artificial Analysis median price would cost $300.00/month — $180.00 more.

Conclusion

Choosing a provider for Nemotron 3 Super is less about the model and more about where your production requirements land. The model is capable and open-weight, but the economics vary enough that the wrong default choice can cost two to three times more for identical traffic.

The criteria that matter most are token pricing, output verbosity, and the features that help you control both. Nemotron 3 Super generates unusually long outputs relative to similar open-weight models, which means the 5x output-to-input price ratio compounds quickly on workloads that don’t actively constrain generation length. Prompt caching becomes meaningful when system prompts and retrieval scaffolding are stable across thousands of calls. JSON and function calling are practical ways to keep the model from generating more tokens than your application actually needs. The API benchmarks for Nemotron 3 Super give you a clearer picture of how throughput and latency hold up under real conditions.

If raw per-token cost is the only variable, OpenRouter has the lowest published price in this dataset. If you’re building something for production — with repeated prefixes, structured outputs, or deployment constraints — DeepInfra‘s combination of competitive pricing, prompt caching, and private endpoint options is the more complete package. The DeepInfra model catalog makes it straightforward to compare adjacent options when you’re wiring this into an agent loop, a RAG pipeline, or a batch processing job.

Related articles
How to Use OpenClaw with DeepInfra: Setup & Workflow GuideHow to Use OpenClaw with DeepInfra: Setup & Workflow Guide<p>When you first learn how to use OpenClaw, the onboarding flow asks for an API key and points you toward Anthropic or OpenAI. Reasonable starting point. For production agents running dozens of tasks a day, it&#8217;s an expensive one. OpenClaw works with any OpenAI-compatible API, so you can swap the default model for an open-weight [&hellip;]</p>
Seed Anchoring and Parameter Tweaking with SDXL Turbo: Create Stunning Cubist ArtSeed Anchoring and Parameter Tweaking with SDXL Turbo: Create Stunning Cubist ArtIn this blog post, we're going to explore how to create stunning cubist art using SDXL Turbo using some advanced image generation techniques.
Kimi K2.5 API Benchmarks: Latency, Throughput & CostKimi K2.5 API Benchmarks: Latency, Throughput & Cost<p>About Kimi K2.5 Kimi K2.5 is Moonshot AI&#8217;s flagship open-source reasoning model, released in January 2026. It is a native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens. The model features a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion activated parameters. Kimi K2.5 [&hellip;]</p>