We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

Nemotron 3 Nano Omni — the first multimodal model in the Nemotron 3 family, now on DeepInfra!

DeepSeek V4 Pro (Max) API Benchmarks: Latency, Throughput & Cost Analysis
Published on 2026.04.30 by DeepInfra
DeepSeek V4 Pro (Max) API Benchmarks: Latency, Throughput & Cost Analysis

About DeepSeek V4 Pro

DeepSeek V4 Pro is a Mixture-of-Experts (MoE) language model with 1.6 trillion total parameters and 49 billion activated parameters, supporting a 1 million token context window. Designed for advanced reasoning, coding, and long-horizon agent workflows, it represents the fourth generation of DeepSeek’s flagship open-weight models.

The model introduces a hybrid attention architecture combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) — an architectural innovation that makes long-context inference dramatically more efficient. At 1M-token context, DeepSeek V4 Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2.

DeepSeek V4 Pro (Max) is the maximum reasoning effort mode. It uses extended chain-of-thought reasoning before generating an answer, which makes provider selection critical: time to first token and time to first answer token behave differently here than in standard generation models, and the gap between them can be significant. The model is pre-trained on more than 32 trillion tokens, uses the Muon optimizer for training stability, and is released under the MIT license.

DeepSeek V4 Pro is now available across multiple inference providers. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.

DeepSeek V4 Pro (Max) API Review Summary

  • 6 API providers benchmarked: Fireworks, DeepInfra (FP4), Novita, Together.ai, DeepSeek, and SiliconFlow
  • Benchmarks are median (P50) over the past 72 hours, with the default workload set to 10,000 input tokens
  • Fastest output speed: Fireworks at 167.1 t/s — significantly ahead of all other providers
  • Lowest latency (TTFT): Together.ai at 0.99s, the only provider to break sub-second
  • Lowest blended price: DeepInfra (FP4), Fireworks, Novita, DeepSeek, and SiliconFlow all tied at $2.17/1M tokens
  • Input/Output pricing: $1.74 / $3.48 per 1M tokens (DeepInfra, Fireworks, and Novita; exact split not published for DeepSeek and SiliconFlow)
  • Price dispersion: Up to 1.2x across providers — $2.17 vs $2.67 blended
  • All 6 providers support JSON mode and function calling

DeepSeek V4 Pro (Max) — Best APIs

Best ForProviderSpeed (t/s)TTFT (s)Blended ($/1M)ContextJSONFuncWhy Notable
Overall RecommendationDeepInfra (FP4)331.19s$2.1766kYesYesTied-lowest price; top-3 TTFT; FP4 quantization for efficient, stable inference
Best raw throughputFireworks167.11.13s$2.171MYesYes5x faster generation than any other provider; lowest time to first answer token (27.32s)
Lowest TTFTTogether.ai40.80.99s$2.67512kYesYesOnly sub-second TTFT; 1.2x price premium over the $2.17 tier
Balanced mid-tierNovita35.62.07s$2.171MYesYesTied-lowest price; full 1M context; slightly higher latency
Official baselineDeepSeek34.61.85s$2.171MYesYesDirect provider access; 128.46s time to first answer token
Reliable fallbackSiliconFlow35.21.97s$2.171MYesYesSpecs mirror the official DeepSeek API; solid routing fallback

Quick Verdict: Which DeepSeek V4 Pro Provider is Best?

Based on benchmarks across 6 tracked providers, DeepInfra is the recommended API for production DeepSeek V4 Pro (Max) deployment. It matches the lowest available blended price ($2.17/1M), delivers a top-3 time to first token (1.19s), and uses FP4 quantization for efficient, stable inference under sustained load. For applications requiring maximum raw generation speed, Fireworks leads at 167.1 t/s with the lowest time to first answer token at 27.32s — though at the same price point. For sub-second initial latency, Together.ai is the only option, at a 1.2x price premium.

Overall Recommendation: DeepInfra (FP4)

DeepInfra is the recommended API provider for DeepSeek V4 Pro (Max), offering the best balance of cost, latency, and production stability across all 6 benchmarked providers.

  • Output Speed: 33 t/s
  • Time to First Token: 1.19s (top 3)
  • Blended Price: $2.17 / 1M tokens (tied lowest)
  • Input Price: $1.74 / 1M tokens
  • Output Price: $3.48 / 1M tokens
  • Context Window: 66k tokens
  • API Features: JSON Mode + Function Calling — both supported

Five of six providers converge on the same $2.17 blended price, which means the meaningful differentiation comes from latency, throughput, context window, and infrastructure reliability. DeepInfra’s FP4 quantization reduces memory bandwidth bottlenecks, which translates to more consistent performance under concurrent production load. While its 66k context window is smaller than the 1M offered by other providers, it is sufficient for the majority of agentic and reasoning workloads — and for those that require full 1M context, Fireworks or Novita are the natural alternatives at the same price tier.

Start using DeepSeek V4 Pro on DeepInfra →

Provider Analyses

1. Fireworks — Best for Raw Output Throughput

  • Output Speed: 167.1 t/s
  • Time to First Token: 1.13s
  • Time to First Answer Token: 27.32s
  • Blended Price: $2.17 / 1M tokens
  • Input / Output Price: $1.74 / $3.48 per 1M tokens
  • Context Window: 1M tokens
  • API Features: JSON Mode, Function Calling

Fireworks is the clear throughput leader, clocking 167.1 t/s — roughly 5x faster than any other provider in this benchmark. It also posts the lowest time to first answer token at 27.32s, which matters significantly for reasoning models where the gap between first token and first answer token can stretch to well over 100 seconds elsewhere. At the same $2.17 blended price as DeepInfra, Fireworks is the natural choice when generation speed is the primary constraint and the full 1M context window is required.

2. Together.ai — Best for Lowest Initial Latency

  • Output Speed: 40.8 t/s
  • Time to First Token: 0.99s
  • Blended Price: $2.67 / 1M tokens
  • Context Window: 512k tokens
  • API Features: JSON Mode, Function Calling

Together.ai is the only provider to break the sub-second barrier on time to first token (0.99s), making it the right choice when initial responsiveness is a strict SLA requirement. Its output speed of 40.8 t/s is respectable. The trade-off is price: at $2.67/1M blended, it costs roughly 1.2x more than the $2.17 tier, and its context window is capped at 512k rather than the full 1M available elsewhere.

3. Novita — Balanced Low-Cost Alternative

  • Output Speed: 35.6 t/s
  • Time to First Token: 2.07s
  • Blended Price: $2.17 / 1M tokens
  • Input / Output Price: $1.74 / $3.48 per 1M tokens
  • Context Window: 1M tokens
  • API Features: JSON Mode, Function Calling

Novita matches the lowest available price and offers the full 1M context window, making it a strong option when maximum context is needed but throughput and latency are not primary concerns. Its 2.07s TTFT is noticeably slower than DeepInfra and Together.ai, but its output speed of 35.6 t/s is adequate for standard workloads. It works well as a secondary provider in intelligent routing setups.

4. DeepSeek (Official API) — The Baseline

  • Output Speed: 34.6 t/s
  • Time to First Token: 1.85s
  • Time to First Answer Token: 128.46s
  • Blended Price: $2.17 / 1M tokens
  • Context Window: 1M tokens
  • API Features: JSON Mode, Function Calling

The official DeepSeek API provides the full 1M context window and matched pricing, but its time to first answer token of 128.46s is the highest in the benchmark — significantly behind Fireworks (27.32s). For reasoning-heavy workloads where the model’s thinking time directly affects user-facing latency, this is a meaningful gap. It serves as a useful baseline and is the appropriate choice for teams that specifically require direct provider access.

5. SiliconFlow — Reliable Fallback

  • Output Speed: 35.2 t/s
  • Time to First Token: 1.97s
  • Blended Price: $2.17 / 1M tokens
  • Context Window: 1M tokens
  • API Features: JSON Mode, Function Calling

SiliconFlow’s specs closely mirror the official DeepSeek API — similar throughput, similar latency, same price tier, full 1M context. It is best suited as a fallback provider in intelligent routing systems, ensuring continuity if a primary provider experiences downtime without requiring any changes to application logic.

Technical Deep-Dive: What Developers Need to Know

1. The “Thinking Time” Gap in Reasoning Models

DeepSeek V4 Pro (Max) is a reasoning model, which means developers need to distinguish between two latency metrics that behave differently here than in standard generation models:

  • Time to First Token (TTFT): The time from sending a request to receiving the first token back — typically the start of the model’s internal reasoning process.
  • Time to First Answer Token: The time until the model finishes “thinking” and begins generating the actual response.

For DeepSeek V4 Pro (Max), this gap is significant. Together.ai and DeepInfra excel on TTFT (0.99s and 1.19s respectively), but Fireworks dramatically reduces total thinking time, achieving a time to first answer token of 27.32s versus 128.46s for the official DeepSeek API. For user-facing applications, the time to first answer token is usually the number that matters.

2. FP4 Quantization and Context Window Trade-offs

DeepInfra serves DeepSeek V4 Pro using FP4 quantization (4-bit floating-point), which reduces memory bandwidth requirements and enables more stable inference under concurrent load. The trade-off is context window size: DeepInfra’s context window is 66k tokens versus 1M for Fireworks, Novita, DeepSeek, and SiliconFlow. For the majority of agentic and reasoning tasks, 66k tokens is more than sufficient. Applications requiring full 1M context — such as large codebase ingestion or massive document retrieval — should route to Fireworks or Novita, both of which are at the same $2.17 blended price.

3. API Feature Parity Across All Providers

All 6 providers support both JSON mode and function calling. This means developers can switch between providers — or implement intelligent routing across them — without rewriting application logic. For reasoning workloads where different providers may be better suited to different task types or traffic conditions, this feature parity is a meaningful operational advantage.

FAQ

What is the cheapest DeepSeek V4 Pro (Max) API provider?

Five providers are tied at $2.17/1M blended tokens: DeepInfra, Fireworks, Novita, DeepSeek, and SiliconFlow. Among these, DeepInfra (FP4) offers the best overall value when factoring in latency and infrastructure efficiency.

Which provider has the highest context window?

Fireworks, Novita, DeepSeek, and SiliconFlow all offer the full 1M token context window. Together.ai supports 512k tokens, while DeepInfra (FP4) is limited to 66k tokens due to its FP4 quantization approach.

What is the difference between Time to First Token and Time to First Answer Token?

Time to First Token (TTFT) measures the time from sending a request to receiving the very first token back — typically the start of the model’s reasoning process. Time to First Answer Token measures the time until the model completes its internal thinking and begins generating the actual response. For reasoning models like DeepSeek V4 Pro (Max), this distinction is critical: TTFT can be under 1 second while time to first answer token can exceed 2 minutes, depending on the provider.

Which provider has the fastest output speed?

Fireworks leads at 167.1 t/s — roughly 5x faster than all other providers, which range from 33 to 41 t/s.

Related articles
Langchain improvements: async and streamingLangchain improvements: async and streamingStarting from langchain v0.0.322 you can make efficient async generation and streaming tokens with deepinfra. Async generation The deepinfra wrapper now supports native async calls, so you can expect more performance (no more t...
Open vs Closed Source AI Models: Intelligence, Price & Speed ComparedOpen vs Closed Source AI Models: Intelligence, Price & Speed Compared<p>The LLM landscape in 2026 looks nothing like it did two years ago. Back then the assumption was simple: if you wanted the best model, you paid OpenAI or Anthropic, and that was that. Open source models were a respectable second tier, good for experimentation, fine-tuning, and budget workloads, but not quite there for serious [&hellip;]</p>
Kimi K2.6 is Now Available on DeepInfraKimi K2.6 is Now Available on DeepInfra<p>Kimi K2.6 can coordinate up to 300 sub-agents executing 4,000 steps in a single autonomous run — Moonshot AI&#8217;s answer to the gap between what frontier models can do in a chat window and what production agentic systems actually need. Built for long-horizon coding, deep research, and complex orchestration, the model is open source under [&hellip;]</p>