Nemotron 3 Nano Omni — the first multimodal model in the Nemotron 3 family, now on DeepInfra!

DeepSeek V4 Pro matters because it pushes two levers developers actually care about at the same time: open-weight availability and a very competitive provider market. As of the research here, DeepSeek V4 Pro Max is tracked across six API providers, and five of them cluster at the same blended price of $2.17 per 1M tokens while still offering JSON mode and function calling. That is the kind of pricing compression that changes deployment decisions, especially if you are comparing throughput, latency, and integration tradeoffs rather than just headline model quality.
More specifically, the model covered in these sources is DeepSeek V4 Pro, released on April 24, 2026 by DeepSeek as part of the DeepSeek-V4 family. It is a 1.6T-parameter Mixture-of-Experts model with 49B activated parameters, and both OpenRouter and DeepInfra describe it with a 1M-token context at the model level. DeepInfra lists it as DeepSeek-V4-Pro in the DeepSeek-V4 preview series under the MIT license, while also noting that its public endpoint uses FP4 quantization and supports JSON mode and function calling.
What makes DeepSeek V4 Pro worth evaluating is that it combines long-context design with strong reasoning and coding results while staying cheap enough to be a serious production option. In the supplied research, DeepSeek-V4-Pro-Max posts 93.5 on LiveCodeBench, 3206 on Codeforces, 80.6 on SWE Verified, 83.5 on MRCR 1M, and 62.0 on CorpusQA 1M. On DeepInfra pricing, the model is $1.74 per 1M input tokens, $3.48 per 1M output tokens, and $0.145 per 1M cached tokens; on OpenRouter, the listing shows $0.435 per 1M input tokens and $0.87 per 1M output tokens. That gives buyers a real menu of tradeoffs: same core model, different economics, and very different serving performance depending on provider.
For developers and ML teams, the practical question is not whether DeepSeek V4 Pro is interesting. It is whether you want the cheapest access path, the fastest output speed, the lowest raw time to first token, or a platform setup that fits your stack. The benchmark data here makes that evaluation unusually concrete: Fireworks leads on output speed and end-to-end latency, DeepInfra ties for the lowest benchmarked price while adding cached-token pricing and private deployment support, Together.ai posts the best raw time to first token at 0.99 seconds, and OpenRouter offers another access path with much lower listed per-token rates on its model page.
DeepSeek V4 Pro is an open-weight reasoning model from DeepSeek that is now sold through a crowded provider market rather than a single gatekeeper. In the research here, six API providers are benchmarked for DeepSeek V4 Pro Max, with Fireworks, DeepInfra, Novita, DeepSeek, and SiliconFlow all at $2.17 per 1M blended tokens and Together.ai at $2.67, while OpenRouter separately lists $0.435 input / $0.87 output per 1M tokens for deepseek/deepseek-v4-pro. If you want a long-context open model for coding, reasoning, and agentic workflows, this is worth reading; if you care most about cost controls and production flexibility, DeepInfra stands out, while Fireworks, Together.ai, Novita, DeepSeek, SiliconFlow, and OpenRouter all have credible reasons to be on the shortlist.
| Best For | Provider Recommendation | Why |
|---|---|---|
| Lowest price / cost-sensitive workloads | DeepInfra | DeepInfra ties for the lowest benchmarked blended price at $2.17 per 1M tokens and is the only source here that also lists cached tokens at $0.145 per 1M. |
| Proprietary or managed model access | OpenRouter | OpenRouter exposes the model as deepseek/deepseek-v4-pro with listed pricing of $0.435 input and $0.87 output per 1M tokens, making it a distinct managed access path in the research. |
| Easiest onboarding / fastest time-to-first-call | Together.ai | Together.ai has the best recorded raw time to first token at 0.99s, ahead of Fireworks at 1.13s and DeepInfra at 1.19s. |
| RAG, document-heavy, or high-throughput use cases | DeepInfra | DeepInfra supports JSON mode, function calling, and cached-token pricing, which is directly useful for repeated-context and document-heavy workloads. |
| Maximum output speed | Fireworks | Fireworks is the clear throughput leader at 167.1 tokens/sec, versus 40.8 for Together.ai and 32.6 for DeepInfra FP4. |
| Best end-to-end latency when reasoning time matters | Fireworks | Fireworks ranks first on time to first answer token at 27.32s, far ahead of the next provider in the benchmarked set. |
| Broadest benchmarked parity at low price | Novita | Novita matches the $2.17 per 1M blended price and lands mid-pack on throughput at 35.6 tokens/sec, which is better than DeepInfra, DeepSeek, and SiliconFlow in the benchmark. |
| Direct-from-model creator access | DeepSeek | DeepSeek is one of the six benchmarked providers and matches the $2.17 per 1M blended price while offering access from the model creator itself. |
If you have ever looked at a model bill and thought, “that prompt was not that long,” this is the part that explains why it was.
A token is a chunk of text the model reads or generates. It is not the same thing as a word. Short words may be one token. Longer words, code, JSON, whitespace patterns, and weird punctuation splits can turn into more. Pricing is based on tokens, so the shape of your workload matters more than the raw character count.
| Token type | What it is | Why it matters |
|---|---|---|
| Input tokens | Tokens you send in the request prompt | Usually the biggest driver for RAG, long-context, and agent workloads |
| Output tokens | Tokens the model generates back | Expensive when you ask for long answers, code diffs, or verbose JSON |
| Cached tokens | Reused prompt tokens billed at a reduced rate when supported | Can materially cut cost for repeated context-heavy requests |
| Blended tokens | A benchmarked combination of input and output token costs | Good for provider comparison, bad for forecasting if your traffic mix is different |
DeepSeek V4 is one of those models where the “headline price” can mislead you if you do not check how the provider bills input versus output.
| Provider | Token cost profile | Advantage | Disadvantage |
|---|---|---|---|
| DeepInfra | $1.74 input / $3.48 output / $0.145 cached per 1M | Clear pricing, cached-token discount, good fit for repeated-context workloads | Output tokens are expensive enough that rambling responses will punish you |
| Fireworks | $1.74 input / $3.48 output per 1M | Same benchmarked blended price as DeepInfra, but much faster output speed | No cached-token price listed in the research, so repeated long prompts may cost more than expected |
| Novita | $1.74 input / $3.48 output per 1M | Same token economics as Fireworks and DeepInfra in the benchmarked set | No cached-token pricing surfaced in the supplied research |
| DeepSeek | $2.17 blended per 1M in Artificial Analysis | Matches the low blended benchmark price | Less transparent in the supplied sources on exact input/output split |
| SiliconFlow | $2.17 blended per 1M in Artificial Analysis | Matches the low blended benchmark price | Same issue: blended pricing is useful for ranking, not great for estimating your own bill |
| Together.ai | $2.67 blended per 1M in Artificial Analysis | Fast raw time to first token | Highest benchmarked blended price of the tracked providers |
| OpenRouter | $0.435 input / $0.87 output per 1M on listing page | By far the lowest listed per-token rates in the supplied material | OpenRouter lists “effective pricing,” so actual economics depend on routing and provider availability underneath |
DeepInfra has the most practical pricing model for cost control in the supplied sources.
Fireworks, DeepInfra, and Novita are effectively tied on base token rates in the benchmark data.
Output tokens are where teams get sloppy and bills get weird.
Blended pricing hides workload shape.
OpenRouter is the odd one out on price.
Long-context apps should care less about the headline model price and more about prompt reuse.
JSON mode and function calling help cost indirectly.
Practical rule of thumb for DeepSeek V4 billing
DeepInfra is the power-user pick for DeepSeek V4 because it combines low token pricing with machine learning infrastructure built for serious deployment. The platform runs on bare-metal infrastructure, which matters because cutting out extra virtualization layers can help with both performance consistency and cost efficiency. DeepInfra is also typically 50–80% cheaper than major cloud competitors, which makes it especially attractive for developers, high-volume API users, and cost-conscious teams trying to keep long-context or agent workloads under control. If you want an API provider that feels built for people who actually watch their token bill, this one stands out.
| Model Name | Best Use Case | Context Window | Input Price (per 1M tokens) | Output Price (per 1M tokens) |
|---|---|---|---|---|
| DeepSeek-V4-Pro | High-end reasoning, coding, and agent workflows | 1M | $1.74 | $3.48 |
Why this matters: On DeepInfra, DeepSeek V4 is priced at $1.74 per 1M input tokens and $3.48 per 1M output tokens. That gives you a very cost-efficient path for large-scale reasoning and coding workloads on a provider that also exposes cached tokens at $0.145 per 1M, which can further reduce spend when you reuse long prompts or repeated context.
If you expect heavy traffic, repeated-context prompts, or just want tighter control over serving costs, DeepInfra is one of the strongest places to run DeepSeek V4 in production. Teams comparing options across the broader model catalog often land on V4 Pro after weighing capability against price.
Below are practical developer scenarios where DeepInfra is a particularly strong way to run DeepSeek-V4-Pro: not because it is uniquely the cheapest in every benchmark, but because it combines the low benchmarked price tier with clear input/output pricing, cached-token pricing at $0.145 per 1M, JSON mode, function calling, and private endpoint support.
Scenario 1: RAG support bot with a large repeated knowledge prefix
If you are building a support copilot or internal docs assistant, you often resend the same long system prompt, retrieval scaffold, and tool instructions over and over. This is exactly the kind of workload where DeepInfra’s cached-token pricing is useful.
Assumptions
| Volume | Model | Provider | Input Tokens | Output Tokens | Monthly Cost |
|---|---|---|---|---|---|
| 1,000 requests | DeepSeek-V4-Pro | DeepInfra | 20M standard input + 60M cached | 4M output | $57.42/month |
Cost breakdown
Why DeepInfra fits
Comparison: On a provider charging standard rates for all input at $1.74 per 1M with no cached-token discount, the same workload would cost $153.12/month, so DeepInfra saves $95.70/month.
Scenario 2: Code review and patch generation assistant
For a coding assistant that reads diffs, repository context, and issue text, then emits structured review comments or patch suggestions, DeepSeek-V4-Pro is attractive on capability alone. DeepInfra makes it easier to run that workload with predictable pricing and tool-friendly output.
Assumptions
| Volume | Model | Provider | Input Tokens | Output Tokens | Monthly Cost |
|---|---|---|---|---|---|
| 10,000 requests | DeepSeek-V4-Pro | DeepInfra | 120M input | 20M output | $278.40/month |
Cost breakdown
Why DeepInfra fits
Comparison: On Together.ai at $2.67 per 1M blended tokens, this 140M-token monthly workload would cost $373.80/month, so DeepInfra is $95.40/month cheaper.
Scenario 3: Agent workflow with persistent tool schemas and system instructions
Agent systems often pay an invisible tax: they keep resending the same policy text, tool definitions, and orchestration instructions. DeepInfra is a good fit when that repeated prompt overhead is real and not just theoretical. As DeepInfra’s role as a Hugging Face Inference Provider shows, chat completion and text generation tasks on open-weight LLMs like DeepSeek V4 are first-class workloads on the platform.
Assumptions
| Volume | Model | Provider | Input Tokens | Output Tokens | Monthly Cost |
|---|---|---|---|---|---|
| 50,000 requests | DeepSeek-V4-Pro | DeepInfra | 200M standard input + 300M cached | 50M output | $565.50/month |
Cost breakdown
Why DeepInfra fits
Comparison: If all 500M input tokens were billed at the standard $1.74 per 1M input rate with no cache discount, the same workload would cost $1,044.00/month, so DeepInfra saves $478.50/month.
Scenario 4: Long-context document analysis pipeline
If you are processing contracts, research bundles, policy sets, or large case files, DeepSeek V4’s long-context design is a reason to care, but DeepInfra’s pricing structure is what makes repeated production use more manageable. For workloads that include scanned documents alongside text, you can also pair the V4 Pro pipeline with multimodal options like DeepSeek-OCR, which uses DeepEncoder and DeepSeek3B-MoE-A570M to extract structure before reasoning.
Assumptions
| Volume | Model | Provider | Input Tokens | Output Tokens | Monthly Cost |
|---|---|---|---|---|---|
| 2,000 requests | DeepSeek-V4-Pro | DeepInfra | 200M input | 10M output | $382.80/month |
Cost breakdown
Why DeepInfra fits
Comparison: On Together.ai at $2.67 per 1M blended tokens, this 210M-token workload would cost $560.70/month, so DeepInfra is $177.90/month cheaper.
Scenario 5: Private enterprise deployment for internal engineering tools
Sometimes the key advantage is not raw throughput. It is being able to use the same model in a more controlled deployment setup while keeping pricing understandable. That is where DeepInfra’s private endpoint support becomes more relevant than a leaderboard win. Teams that need dedicated compute can also spin up GPU instances and get from idea to a GPU-powered container in under 10 seconds, which is useful when you want isolation without long provisioning cycles.
Assumptions
| Volume | Model | Provider | Input Tokens | Output Tokens | Monthly Cost |
|---|---|---|---|---|---|
| 25,000 requests | DeepSeek-V4-Pro | DeepInfra | 200M input | 37.5M output | $478.50/month |
Cost breakdown
Why DeepInfra fits
Comparison: On Together.ai at $2.67 per 1M blended tokens, this 237.5M-token workload would cost $634.13/month, so DeepInfra is $155.63/month cheaper.
Choosing a provider for DeepSeek V4 Pro is less about finding the “best” option and more about matching provider economics to how your workload actually behaves. The model itself is strong across reasoning, coding, and long-context tasks regardless of where you run it. What differs meaningfully between providers is how you pay for that capability — and whether the platform gives you the controls to keep costs predictable as usage grows.
The two criteria that matter most in practice are token pricing structure and prompt caching. If your app resends large system prompts, tool definitions, or retrieval context repeatedly, the difference between a provider that exposes cached-token pricing and one that does not is real money, not a theoretical discount. DeepInfra’s cached rate of $0.145 per 1M tokens is the clearest lever available in this provider set for that pattern. Beyond caching, output token cost deserves more attention than it usually gets — DeepSeek V4 Pro is the kind of model people use for long code generation and multi-step reasoning, and those workloads generate output tokens fast. At $3.48 per 1M output tokens, verbosity is expensive, and providers that also support JSON mode and function calling help you avoid the retry loops that quietly inflate output counts.
DeepInfra also covers the deployment concerns that matter once you move past early testing: SOC 2 and ISO 27001 certification, private endpoint support, and bare-metal infrastructure that removes a layer of overhead. You can review the V4 Pro API reference to see how straightforward integration looks, or browse the full text generation model catalog if you want to compare DeepSeek V4 Pro against other options before committing. The pricing is transparent, the infrastructure is production-ready, and the first call is easy to make.
Model Distillation Making AI Models EfficientAI Model Distillation Definition & Methodology
Model distillation is the art of teaching a smaller, simpler model to perform as well as a larger one. It's like training an apprentice to take over a master's work—streamlining operations with comparable performance . If you're struggling with depl...
GLM-4.6 API: Get fast first tokens at the best $/M from Deepinfra's API - Deep Infra<p>GLM-4.6 is a high-capacity, “reasoning”-tuned model that shows up in coding copilots, long-context RAG, and multi-tool agent loops. With this class of workload, provider infrastructure determines perceived speed (first-token time), tail stability, and your unit economics. Using ArtificialAnalysis (AA) provider charts for GLM-4.6 (Reasoning), DeepInfra (FP8) pairs a sub-second Time-to-First-Token (TTFT) (0.51 s) with the […]</p>
Nemotron 3 Nano vs GPT-OSS-20B: Performance, Benchmarks & DeepInfra Results<p>The open-source LLM landscape is becoming increasingly diverse, with models optimized for reasoning, throughput, cost-efficiency, and real-world agentic applications. Two models that stand out in this new generation are NVIDIA’s Nemotron 3 Nano and OpenAI’s GPT-OSS-20B, both of which offer strong performance while remaining openly available and deployable across cloud and edge systems. Although both […]</p>
© 2026 Deep Infra. All rights reserved.