DeepSeek V4 Pro (Max) API Benchmarks: Latency, Throughput & Cost Analysis

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.04.30 by DeepInfra

About DeepSeek V4 Pro

DeepSeek V4 Pro is a Mixture-of-Experts (MoE) language model with 1.6 trillion total parameters and 49 billion activated parameters, supporting a 1 million token context window. Designed for advanced reasoning, coding, and long-horizon agent workflows, it represents the fourth generation of DeepSeek’s flagship open-weight models.

The model introduces a hybrid attention architecture combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) — an architectural innovation that makes long-context inference dramatically more efficient. At 1M-token context, DeepSeek V4 Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2.

DeepSeek V4 Pro (Max) is the maximum reasoning effort mode. It uses extended chain-of-thought reasoning before generating an answer, which makes provider selection critical: time to first token and time to first answer token behave differently here than in standard generation models, and the gap between them can be significant. The model is pre-trained on more than 32 trillion tokens, uses the Muon optimizer for training stability, and is released under the MIT license.

DeepSeek V4 Pro is now available across multiple inference providers. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.

DeepSeek V4 Pro (Max) API Review Summary

6 API providers benchmarked: Fireworks, DeepInfra (FP4), Novita, Together.ai, DeepSeek, and SiliconFlow
Benchmarks are median (P50) over the past 72 hours, with the default workload set to 10,000 input tokens
Fastest output speed: Fireworks at 167.1 t/s — significantly ahead of all other providers
Lowest latency (TTFT): Together.ai at 0.99s, the only provider to break sub-second
Lowest blended price: DeepInfra (FP4), Fireworks, Novita, DeepSeek, and SiliconFlow all tied at $2.17/1M tokens
Input/Output pricing: $1.74 / $3.48 per 1M tokens (DeepInfra, Fireworks, and Novita; exact split not published for DeepSeek and SiliconFlow)
Price dispersion: Up to 1.2x across providers — $2.17 vs $2.67 blended
All 6 providers support JSON mode and function calling

DeepSeek V4 Pro (Max) — Best APIs

Best For	Provider	Speed (t/s)	TTFT (s)	Blended ($/1M)	Context	JSON	Func	Why Notable
Overall Recommendation	DeepInfra (FP4)	33	1.19s	$2.17	66k	Yes	Yes	Tied-lowest price; top-3 TTFT; FP4 quantization for efficient, stable inference
Best raw throughput	Fireworks	167.1	1.13s	$2.17	1M	Yes	Yes	5x faster generation than any other provider; lowest time to first answer token (27.32s)
Lowest TTFT	Together.ai	40.8	0.99s	$2.67	512k	Yes	Yes	Only sub-second TTFT; 1.2x price premium over the $2.17 tier
Balanced mid-tier	Novita	35.6	2.07s	$2.17	1M	Yes	Yes	Tied-lowest price; full 1M context; slightly higher latency
Official baseline	DeepSeek	34.6	1.85s	$2.17	1M	Yes	Yes	Direct provider access; 128.46s time to first answer token
Reliable fallback	SiliconFlow	35.2	1.97s	$2.17	1M	Yes	Yes	Specs mirror the official DeepSeek API; solid routing fallback

Quick Verdict: Which DeepSeek V4 Pro Provider is Best?

Based on benchmarks across 6 tracked providers, DeepInfra is the recommended API for production DeepSeek V4 Pro (Max) deployment. It matches the lowest available blended price ($2.17/1M), delivers a top-3 time to first token (1.19s), and uses FP4 quantization for efficient, stable inference under sustained load. For applications requiring maximum raw generation speed, Fireworks leads at 167.1 t/s with the lowest time to first answer token at 27.32s — though at the same price point. For sub-second initial latency, Together.ai is the only option, at a 1.2x price premium.

Overall Recommendation: DeepInfra (FP4)

DeepInfra is the recommended API provider for DeepSeek V4 Pro (Max), offering the best balance of cost, latency, and production stability across all 6 benchmarked providers.

Output Speed: 33 t/s
Time to First Token: 1.19s (top 3)
Blended Price: $2.17 / 1M tokens (tied lowest)
Input Price: $1.74 / 1M tokens
Output Price: $3.48 / 1M tokens
Context Window: 66k tokens
API Features: JSON Mode + Function Calling — both supported

Five of six providers converge on the same $2.17 blended price, which means the meaningful differentiation comes from latency, throughput, context window, and infrastructure reliability. DeepInfra’s FP4 quantization reduces memory bandwidth bottlenecks, which translates to more consistent performance under concurrent production load. While its 66k context window is smaller than the 1M offered by other providers, it is sufficient for the majority of agentic and reasoning workloads — and for those that require full 1M context, Fireworks or Novita are the natural alternatives at the same price tier.

Start using DeepSeek V4 Pro on DeepInfra →

Provider Analyses

1. Fireworks — Best for Raw Output Throughput

Output Speed: 167.1 t/s
Time to First Token: 1.13s
Time to First Answer Token: 27.32s
Blended Price: $2.17 / 1M tokens
Input / Output Price: $1.74 / $3.48 per 1M tokens
Context Window: 1M tokens
API Features: JSON Mode, Function Calling

Fireworks is the clear throughput leader, clocking 167.1 t/s — roughly 5x faster than any other provider in this benchmark. It also posts the lowest time to first answer token at 27.32s, which matters significantly for reasoning models where the gap between first token and first answer token can stretch to well over 100 seconds elsewhere. At the same $2.17 blended price as DeepInfra, Fireworks is the natural choice when generation speed is the primary constraint and the full 1M context window is required.

2. Together.ai — Best for Lowest Initial Latency

Output Speed: 40.8 t/s
Time to First Token: 0.99s
Blended Price: $2.67 / 1M tokens
Context Window: 512k tokens
API Features: JSON Mode, Function Calling

Together.ai is the only provider to break the sub-second barrier on time to first token (0.99s), making it the right choice when initial responsiveness is a strict SLA requirement. Its output speed of 40.8 t/s is respectable. The trade-off is price: at $2.67/1M blended, it costs roughly 1.2x more than the $2.17 tier, and its context window is capped at 512k rather than the full 1M available elsewhere.

3. Novita — Balanced Low-Cost Alternative

Output Speed: 35.6 t/s
Time to First Token: 2.07s
Blended Price: $2.17 / 1M tokens
Input / Output Price: $1.74 / $3.48 per 1M tokens
Context Window: 1M tokens
API Features: JSON Mode, Function Calling

Novita matches the lowest available price and offers the full 1M context window, making it a strong option when maximum context is needed but throughput and latency are not primary concerns. Its 2.07s TTFT is noticeably slower than DeepInfra and Together.ai, but its output speed of 35.6 t/s is adequate for standard workloads. It works well as a secondary provider in intelligent routing setups.

4. DeepSeek (Official API) — The Baseline

Output Speed: 34.6 t/s
Time to First Token: 1.85s
Time to First Answer Token: 128.46s
Blended Price: $2.17 / 1M tokens
Context Window: 1M tokens
API Features: JSON Mode, Function Calling

The official DeepSeek API provides the full 1M context window and matched pricing, but its time to first answer token of 128.46s is the highest in the benchmark — significantly behind Fireworks (27.32s). For reasoning-heavy workloads where the model’s thinking time directly affects user-facing latency, this is a meaningful gap. It serves as a useful baseline and is the appropriate choice for teams that specifically require direct provider access.

5. SiliconFlow — Reliable Fallback

Output Speed: 35.2 t/s
Time to First Token: 1.97s
Blended Price: $2.17 / 1M tokens
Context Window: 1M tokens
API Features: JSON Mode, Function Calling

SiliconFlow’s specs closely mirror the official DeepSeek API — similar throughput, similar latency, same price tier, full 1M context. It is best suited as a fallback provider in intelligent routing systems, ensuring continuity if a primary provider experiences downtime without requiring any changes to application logic.

Technical Deep-Dive: What Developers Need to Know

1. The “Thinking Time” Gap in Reasoning Models

DeepSeek V4 Pro (Max) is a reasoning model, which means developers need to distinguish between two latency metrics that behave differently here than in standard generation models:

Time to First Token (TTFT): The time from sending a request to receiving the first token back — typically the start of the model’s internal reasoning process.
Time to First Answer Token: The time until the model finishes “thinking” and begins generating the actual response.

For DeepSeek V4 Pro (Max), this gap is significant. Together.ai and DeepInfra excel on TTFT (0.99s and 1.19s respectively), but Fireworks dramatically reduces total thinking time, achieving a time to first answer token of 27.32s versus 128.46s for the official DeepSeek API. For user-facing applications, the time to first answer token is usually the number that matters.

2. FP4 Quantization and Context Window Trade-offs

DeepInfra serves DeepSeek V4 Pro using FP4 quantization (4-bit floating-point), which reduces memory bandwidth requirements and enables more stable inference under concurrent load. The trade-off is context window size: DeepInfra’s context window is 66k tokens versus 1M for Fireworks, Novita, DeepSeek, and SiliconFlow. For the majority of agentic and reasoning tasks, 66k tokens is more than sufficient. Applications requiring full 1M context — such as large codebase ingestion or massive document retrieval — should route to Fireworks or Novita, both of which are at the same $2.17 blended price.

3. API Feature Parity Across All Providers

All 6 providers support both JSON mode and function calling. This means developers can switch between providers — or implement intelligent routing across them — without rewriting application logic. For reasoning workloads where different providers may be better suited to different task types or traffic conditions, this feature parity is a meaningful operational advantage.

FAQ

What is the cheapest DeepSeek V4 Pro (Max) API provider?

Five providers are tied at $2.17/1M blended tokens: DeepInfra, Fireworks, Novita, DeepSeek, and SiliconFlow. Among these, DeepInfra (FP4) offers the best overall value when factoring in latency and infrastructure efficiency.

Which provider has the highest context window?

Fireworks, Novita, DeepSeek, and SiliconFlow all offer the full 1M token context window. Together.ai supports 512k tokens, while DeepInfra (FP4) is limited to 66k tokens due to its FP4 quantization approach.

What is the difference between Time to First Token and Time to First Answer Token?

Time to First Token (TTFT) measures the time from sending a request to receiving the very first token back — typically the start of the model’s reasoning process. Time to First Answer Token measures the time until the model completes its internal thinking and begins generating the actual response. For reasoning models like DeepSeek V4 Pro (Max), this distinction is critical: TTFT can be under 1 second while time to first answer token can exceed 2 minutes, depending on the provider.

Which provider has the fastest output speed?

Fireworks leads at 167.1 t/s — roughly 5x faster than all other providers, which range from 33 to 41 t/s.

Nemotron 3 Nano vs GPT-OSS-20B: Performance, Benchmarks & DeepInfra Results<p>The open-source LLM landscape is becoming increasingly diverse, with models optimized for reasoning, throughput, cost-efficiency, and real-world agentic applications. Two models that stand out in this new generation are NVIDIA’s Nemotron 3 Nano and OpenAI’s GPT-OSS-20B, both of which offer strong performance while remaining openly available and deployable across cloud and edge systems. Although both […]</p>

Kimi K2 0905 API from Deepinfra: Practical Speed, Predictable Costs, Built for Devs - Deep Infra<p>Kimi K2 0905 is Moonshot’s long-context Mixture-of-Experts update designed for agentic and coding workflows. With a context window up to ~256K tokens, it can ingest large codebases, multi-file documents, or long conversations and still deliver structured, high-quality outputs. But real-world performance isn’t defined by the model alone—it’s determined by the inference provider that serves it: […]</p>

The easiest way to build AI applications with Llama 2 LLMs.The long awaited Llama 2 models are finally here! We are excited to show you how to use them with DeepInfra. These collection of models represent the state of the art in open source language models. They are made available by Meta AI and the l...

View all