DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Nemotron 3 Super is available from multiple providers, and the price spread is real: OpenRouter lists $0.09/$0.45 per 1M input/output tokens, DeepInfra lists $0.10/$0.50, and the Artificial Analysis median across all providers sits at $0.30/$0.75. The right provider depends on what your workload actually looks like — context requirements, output verbosity, and whether you need production features like JSON mode, function calling, or private endpoints.
The model is NVIDIA Nemotron 3 Super 120B A12B, released March 11, 2026 under the NVIDIA Nemotron Open Model License (commercial use permitted). It is an open-weight reasoning model with roughly 120B total parameters and 12B active per inference pass, built for reasoning, tool use, and instruction following. The full release announcement and the NVIDIA Nemotron API pricing guide cover the broader family context.
Best for teams that want an open-weight NVIDIA reasoning model without immediately taking on self-hosting complexity. OpenRouter is the lowest published-price option at $0.09/$0.45 per 1M tokens. DeepInfra is close at $0.10/$0.50 and adds public and private endpoint options plus JSON and function calling support. If you care most about production-friendly integration, DeepInfra is the more balanced pick; if raw token cost is the only variable, OpenRouter leads.
| Best For | Provider | Why |
|---|---|---|
| Managed production deployments with structured outputs | DeepInfra | Public and private endpoints, JSON and function calling, $0.10/$0.50 per 1M tokens. |
| Lowest price / cost-sensitive workloads | OpenRouter | Lowest listed price at $0.09/$0.45 per 1M tokens. |
| Easiest onboarding / fastest time-to-first-call | OpenRouter | Multi-provider routing, free trial endpoint for non-production testing. |
| RAG, document-heavy, or high-throughput use cases | DeepInfra | Prompt caching support combined with JSON and function calling for production retrieval and agent pipelines. |
| Long-context experiments and agent workflows | OpenRouter | Lists a 1M-token context window for multi-agent applications. |
| Security-conscious teams that still want hosted inference | DeepInfra | SOC 2 and ISO 27001 certified; private endpoint deployment available. |
Token pricing is where Nemotron 3 Super gets deceptively simple. The list prices look low. The bill depends on how much text you send, how much the model sends back, and whether your provider discounts repeated prompt prefixes. That matters more here because Artificial Analysis describes Nemotron 3 Super as unusually verbose relative to similar open-weight models — which means output token spend compounds faster than the sticker price suggests. For a deeper primer on how token economics work in practice, see the token math and cost-per-completion guide.
| Token type | What it is | Why it matters |
|---|---|---|
| Input tokens | Tokens you send in the prompt, system message, tool results, chat history, and retrieved documents | Baseline request cost. In RAG and agent systems, input grows quietly over time as every extra document chunk and prior turn gets billed again. |
| Output tokens | Tokens the model generates back | The expensive side for Nemotron 3 Super — output pricing is 5x input, and the model tends to generate a lot. |
| Cached input tokens | Reused prompt tokens some providers bill at a lower rate when the same prefix is sent repeatedly | Can materially reduce cost for repeated system prompts, long boilerplate context, or stable document prefixes. |
| Context window tokens | Total tokens the model can consider in one request, including both input and generated output | Bigger context is useful, but it also makes it easy to accidentally send massive prompts and pay for them. |
The output-to-input ratio is the key number
On both OpenRouter and DeepInfra, output tokens cost 5x input tokens. On the Artificial Analysis median, output is 2.5x input. A quick mental model:
If you only remember one thing about Nemotron 3 Super pricing: choose the provider on per-token price, choose the app design on output control. The second decision usually saves more money than the first.
For a broader look at how prices compare across the full Nemotron lineup, the NVIDIA Nemotron API pricing guide covers the family-wide cost picture.
| Provider | Input /1M | Output /1M | Advantages | Disadvantages |
|---|---|---|---|---|
| OpenRouter | $0.09 | $0.45 | Lowest listed price in this research. Good for cost-sensitive workloads if prompts and output length are disciplined. Lists 1M-token context window. | Output is still 5x input. Free tier logs prompts/outputs — not suitable for sensitive production workloads. |
| DeepInfra | $0.10 | $0.50 | Nearly as cheap as OpenRouter. Adds prompt caching (Prompt Cache Key), JSON and function calling, private endpoints, SOC 2 / ISO 27001. | Slightly higher on both sides. Published context window is 262,144 tokens on the hosted page, not 1M. |
| Artificial Analysis median | $0.30 | $0.75 | Useful as a reality check for what this model often costs outside the cheapest endpoints. Cache-hit median of $0.20/1M. | Much more expensive than OpenRouter and DeepInfra. Choosing a random provider instead of the cheapest can mean 3x higher input costs. |
Where each provider tends to make sense
Two cost traps especially relevant for Nemotron 3 Super
Practical ways to keep token spend under control
DeepInfra is built on bare-metal infrastructure, which translates into lower latency, better hardware utilization, and more predictable costs at scale — typically 50–80% cheaper than major cloud competitors. For Nemotron 3 Super specifically, it hits a useful middle ground: low pricing, production-friendly deployment options, and features that reduce token waste in real workloads. For a look at how DeepInfra compares against other open-weight reasoning model options, see the open vs. closed source model guide.
| Model | Best Use Case | Context Window | Input ($/1M) | Output ($/1M) |
|---|---|---|---|---|
| NVIDIA Nemotron 3 Super 120B A12B | Reasoning, tool use, and multi-agent production workloads | 262,144 tokens | $0.10 | $0.50 |
DeepInfra prices Nemotron 3 Super at $0.10/1M input and $0.50/1M output — only slightly above the cheapest listed endpoint, while adding private endpoint deployment, JSON responses, function calling, and prompt caching support. Teams evaluating smaller-footprint alternatives in the same family can also look at the Nemotron 3 Nano 30B A3B for cost-efficient agentic workloads.
Scenario 1: Structured support copilot with concise JSON replies
An internal support assistant that reads ticket context, classifies the issue, and returns a short structured payload for downstream automation. DeepInfra’s JSON support keeps outputs tight and machine-readable rather than paying for long free-form responses.
| Metric | Value |
|---|---|
| Volume | 1,000,000 requests/month |
| Model | NVIDIA Nemotron 3 Super 120B A12B |
| Provider | DeepInfra |
| Input Tokens | 200,000,000 |
| Output Tokens | 50,000,000 |
| Monthly Cost | $45.00 |
Cost breakdown:
Comparison: The same workload at the Artificial Analysis median price would cost $97.50/month — $52.50 more.
Scenario 2: RAG app with repeated system prompts and stable document prefixes
A production RAG service that reuses the same system instructions, formatting rules, and retrieval scaffolding across large numbers of calls. DeepInfra’s Prompt Cache Key support is directly useful here for repeated-prefix workloads. For background on how RAG cost patterns behave at scale, the LLM API provider performance KPIs guide covers the relevant metrics.
| Metric | Value |
|---|---|
| Volume | 500,000 requests/month |
| Model | NVIDIA Nemotron 3 Super 120B A12B |
| Provider | DeepInfra |
| Input Tokens | 1,000,000,000 |
| Output Tokens | 100,000,000 |
| Monthly Cost | $150.00 |
Cost breakdown:
Comparison: The same workload at the Artificial Analysis median price would cost $375.00/month — $225.00 more.
Scenario 3: Tool-calling agent backend for engineering workflows
A tool-using agent that inspects logs, calls internal APIs, and returns action objects to your application. DeepInfra’s function calling support pushes the model toward short tool arguments and structured responses instead of paying for rambling prose. For teams weighing whether a smaller model might handle some of these agent tasks, the Nemotron 3 Nano vs GPT-OSS-20B comparison is a useful reference.
| Metric | Value |
|---|---|
| Volume | 200,000 agent runs/month |
| Model | NVIDIA Nemotron 3 Super 120B A12B |
| Provider | DeepInfra |
| Input Tokens | 600,000,000 |
| Output Tokens | 120,000,000 |
| Monthly Cost | $120.00 |
Cost breakdown:
Comparison: The same workload at the Artificial Analysis median price would cost $270.00/month — $150.00 more.
Scenario 4: Private-endpoint deployment for a security-conscious internal app
Teams that want hosted inference but still need a more controlled deployment path — internal copilots, document QA, or operational workflows. DeepInfra offers private endpoint deployment while keeping Nemotron 3 Super pricing close to the cheapest option in this dataset.
| Metric | Value |
|---|---|
| Volume | 5,000,000 requests/month |
| Model | NVIDIA Nemotron 3 Super 120B A12B |
| Provider | DeepInfra |
| Input Tokens | 2,500,000,000 |
| Output Tokens | 250,000,000 |
| Monthly Cost | $375.00 |
Cost breakdown:
Comparison: The same workload at the Artificial Analysis median price would cost $937.50/month — $562.50 more.
Scenario 5: High-volume summarization pipeline with controlled output caps
Batch summarization of documents, tickets, transcripts, or knowledge-base updates. Nemotron 3 Super works well here only if generations are kept short. For teams considering routing high-volume batch work to a smaller, faster reasoner, the Nemotron 3 Nano explainer covers where the smaller sibling makes economic sense.
| Metric | Value |
|---|---|
| Volume | 2,000,000 summaries/month |
| Model | NVIDIA Nemotron 3 Super 120B A12B |
| Provider | DeepInfra |
| Input Tokens | 800,000,000 |
| Output Tokens | 80,000,000 |
| Monthly Cost | $120.00 |
Cost breakdown:
Comparison: The same workload at the Artificial Analysis median price would cost $300.00/month — $180.00 more.
Choosing a provider for Nemotron 3 Super is less about the model and more about where your production requirements land. The model is capable and open-weight, but the economics vary enough that the wrong default choice can cost two to three times more for identical traffic.
The criteria that matter most are token pricing, output verbosity, and the features that help you control both. Nemotron 3 Super generates unusually long outputs relative to similar open-weight models, which means the 5x output-to-input price ratio compounds quickly on workloads that don’t actively constrain generation length. Prompt caching becomes meaningful when system prompts and retrieval scaffolding are stable across thousands of calls. JSON and function calling are practical ways to keep the model from generating more tokens than your application actually needs. The API benchmarks for Nemotron 3 Super give you a clearer picture of how throughput and latency hold up under real conditions.
If raw per-token cost is the only variable, OpenRouter has the lowest published price in this dataset. If you’re building something for production — with repeated prefixes, structured outputs, or deployment constraints — DeepInfra‘s combination of competitive pricing, prompt caching, and private endpoint options is the more complete package. The DeepInfra model catalog makes it straightforward to compare adjacent options when you’re wiring this into an agent loop, a RAG pipeline, or a batch processing job.
How to Use OpenClaw with DeepInfra: Setup & Workflow Guide<p>When you first learn how to use OpenClaw, the onboarding flow asks for an API key and points you toward Anthropic or OpenAI. Reasonable starting point. For production agents running dozens of tasks a day, it’s an expensive one. OpenClaw works with any OpenAI-compatible API, so you can swap the default model for an open-weight […]</p>
Seed Anchoring and Parameter Tweaking with SDXL Turbo: Create Stunning Cubist ArtIn this blog post, we're going to explore how to create stunning cubist art using SDXL Turbo using some advanced image generation techniques.
Kimi K2.5 API Benchmarks: Latency, Throughput & Cost<p>About Kimi K2.5 Kimi K2.5 is Moonshot AI’s flagship open-source reasoning model, released in January 2026. It is a native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens. The model features a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion activated parameters. Kimi K2.5 […]</p>
© 2026 DeepInfra. All rights reserved.