Nemotron 3 Nano Omni — the first multimodal model in the Nemotron 3 family, now on DeepInfra!

DeepSeek V4 Pro is a Mixture-of-Experts (MoE) language model with 1.6 trillion total parameters and 49 billion activated parameters, supporting a 1 million token context window. Designed for advanced reasoning, coding, and long-horizon agent workflows, it represents the fourth generation of DeepSeek’s flagship open-weight models.
The model introduces a hybrid attention architecture combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) — an architectural innovation that makes long-context inference dramatically more efficient. At 1M-token context, DeepSeek V4 Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2.
DeepSeek V4 Pro (Max) is the maximum reasoning effort mode. It uses extended chain-of-thought reasoning before generating an answer, which makes provider selection critical: time to first token and time to first answer token behave differently here than in standard generation models, and the gap between them can be significant. The model is pre-trained on more than 32 trillion tokens, uses the Muon optimizer for training stability, and is released under the MIT license.
DeepSeek V4 Pro is now available across multiple inference providers. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.
| Best For | Provider | Speed (t/s) | TTFT (s) | Blended ($/1M) | Context | JSON | Func | Why Notable |
|---|---|---|---|---|---|---|---|---|
| Overall Recommendation | DeepInfra (FP4) | 33 | 1.19s | $2.17 | 66k | Yes | Yes | Tied-lowest price; top-3 TTFT; FP4 quantization for efficient, stable inference |
| Best raw throughput | Fireworks | 167.1 | 1.13s | $2.17 | 1M | Yes | Yes | 5x faster generation than any other provider; lowest time to first answer token (27.32s) |
| Lowest TTFT | Together.ai | 40.8 | 0.99s | $2.67 | 512k | Yes | Yes | Only sub-second TTFT; 1.2x price premium over the $2.17 tier |
| Balanced mid-tier | Novita | 35.6 | 2.07s | $2.17 | 1M | Yes | Yes | Tied-lowest price; full 1M context; slightly higher latency |
| Official baseline | DeepSeek | 34.6 | 1.85s | $2.17 | 1M | Yes | Yes | Direct provider access; 128.46s time to first answer token |
| Reliable fallback | SiliconFlow | 35.2 | 1.97s | $2.17 | 1M | Yes | Yes | Specs mirror the official DeepSeek API; solid routing fallback |
Based on benchmarks across 6 tracked providers, DeepInfra is the recommended API for production DeepSeek V4 Pro (Max) deployment. It matches the lowest available blended price ($2.17/1M), delivers a top-3 time to first token (1.19s), and uses FP4 quantization for efficient, stable inference under sustained load. For applications requiring maximum raw generation speed, Fireworks leads at 167.1 t/s with the lowest time to first answer token at 27.32s — though at the same price point. For sub-second initial latency, Together.ai is the only option, at a 1.2x price premium.
DeepInfra is the recommended API provider for DeepSeek V4 Pro (Max), offering the best balance of cost, latency, and production stability across all 6 benchmarked providers.
Five of six providers converge on the same $2.17 blended price, which means the meaningful differentiation comes from latency, throughput, context window, and infrastructure reliability. DeepInfra’s FP4 quantization reduces memory bandwidth bottlenecks, which translates to more consistent performance under concurrent production load. While its 66k context window is smaller than the 1M offered by other providers, it is sufficient for the majority of agentic and reasoning workloads — and for those that require full 1M context, Fireworks or Novita are the natural alternatives at the same price tier.
Start using DeepSeek V4 Pro on DeepInfra →
Fireworks is the clear throughput leader, clocking 167.1 t/s — roughly 5x faster than any other provider in this benchmark. It also posts the lowest time to first answer token at 27.32s, which matters significantly for reasoning models where the gap between first token and first answer token can stretch to well over 100 seconds elsewhere. At the same $2.17 blended price as DeepInfra, Fireworks is the natural choice when generation speed is the primary constraint and the full 1M context window is required.
Together.ai is the only provider to break the sub-second barrier on time to first token (0.99s), making it the right choice when initial responsiveness is a strict SLA requirement. Its output speed of 40.8 t/s is respectable. The trade-off is price: at $2.67/1M blended, it costs roughly 1.2x more than the $2.17 tier, and its context window is capped at 512k rather than the full 1M available elsewhere.
Novita matches the lowest available price and offers the full 1M context window, making it a strong option when maximum context is needed but throughput and latency are not primary concerns. Its 2.07s TTFT is noticeably slower than DeepInfra and Together.ai, but its output speed of 35.6 t/s is adequate for standard workloads. It works well as a secondary provider in intelligent routing setups.
The official DeepSeek API provides the full 1M context window and matched pricing, but its time to first answer token of 128.46s is the highest in the benchmark — significantly behind Fireworks (27.32s). For reasoning-heavy workloads where the model’s thinking time directly affects user-facing latency, this is a meaningful gap. It serves as a useful baseline and is the appropriate choice for teams that specifically require direct provider access.
SiliconFlow’s specs closely mirror the official DeepSeek API — similar throughput, similar latency, same price tier, full 1M context. It is best suited as a fallback provider in intelligent routing systems, ensuring continuity if a primary provider experiences downtime without requiring any changes to application logic.
DeepSeek V4 Pro (Max) is a reasoning model, which means developers need to distinguish between two latency metrics that behave differently here than in standard generation models:
For DeepSeek V4 Pro (Max), this gap is significant. Together.ai and DeepInfra excel on TTFT (0.99s and 1.19s respectively), but Fireworks dramatically reduces total thinking time, achieving a time to first answer token of 27.32s versus 128.46s for the official DeepSeek API. For user-facing applications, the time to first answer token is usually the number that matters.
DeepInfra serves DeepSeek V4 Pro using FP4 quantization (4-bit floating-point), which reduces memory bandwidth requirements and enables more stable inference under concurrent load. The trade-off is context window size: DeepInfra’s context window is 66k tokens versus 1M for Fireworks, Novita, DeepSeek, and SiliconFlow. For the majority of agentic and reasoning tasks, 66k tokens is more than sufficient. Applications requiring full 1M context — such as large codebase ingestion or massive document retrieval — should route to Fireworks or Novita, both of which are at the same $2.17 blended price.
All 6 providers support both JSON mode and function calling. This means developers can switch between providers — or implement intelligent routing across them — without rewriting application logic. For reasoning workloads where different providers may be better suited to different task types or traffic conditions, this feature parity is a meaningful operational advantage.
What is the cheapest DeepSeek V4 Pro (Max) API provider?
Five providers are tied at $2.17/1M blended tokens: DeepInfra, Fireworks, Novita, DeepSeek, and SiliconFlow. Among these, DeepInfra (FP4) offers the best overall value when factoring in latency and infrastructure efficiency.
Which provider has the highest context window?
Fireworks, Novita, DeepSeek, and SiliconFlow all offer the full 1M token context window. Together.ai supports 512k tokens, while DeepInfra (FP4) is limited to 66k tokens due to its FP4 quantization approach.
What is the difference between Time to First Token and Time to First Answer Token?
Time to First Token (TTFT) measures the time from sending a request to receiving the very first token back — typically the start of the model’s reasoning process. Time to First Answer Token measures the time until the model completes its internal thinking and begins generating the actual response. For reasoning models like DeepSeek V4 Pro (Max), this distinction is critical: TTFT can be under 1 second while time to first answer token can exceed 2 minutes, depending on the provider.
Which provider has the fastest output speed?
Fireworks leads at 167.1 t/s — roughly 5x faster than all other providers, which range from 33 to 41 t/s.
Langchain improvements: async and streamingStarting from langchain
v0.0.322 you
can make efficient async generation and streaming tokens with deepinfra.
Async generation
The deepinfra wrapper now supports native async calls, so you can expect more
performance (no more t...
Open vs Closed Source AI Models: Intelligence, Price & Speed Compared<p>The LLM landscape in 2026 looks nothing like it did two years ago. Back then the assumption was simple: if you wanted the best model, you paid OpenAI or Anthropic, and that was that. Open source models were a respectable second tier, good for experimentation, fine-tuning, and budget workloads, but not quite there for serious […]</p>
Kimi K2.6 is Now Available on DeepInfra<p>Kimi K2.6 can coordinate up to 300 sub-agents executing 4,000 steps in a single autonomous run — Moonshot AI’s answer to the gap between what frontier models can do in a chat window and what production agentic systems actually need. Built for long-horizon coding, deep research, and complex orchestration, the model is open source under […]</p>
© 2026 Deep Infra. All rights reserved.