DeepSeek V3.2 API Benchmarks: Latency, Throughput & Cost

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.04.03 by DeepInfra

About DeepSeek V3.2

DeepSeek V3.2 is a state-of-the-art large language model that unifies conversational speed and deep reasoning in a single 685B parameter Mixture of Experts (MoE) architecture with 37B parameters activated per token. It is built around three key technical breakthroughs:

DeepSeek Sparse Attention (DSA): An efficient attention mechanism that substantially reduces computational complexity while preserving model performance, specifically optimized for long-context scenarios.
Scalable Reinforcement Learning Framework: A robust RL protocol with scaled post-training compute that enables performance comparable to GPT-5. The high-compute variant surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro.
Large-Scale Agentic Task Synthesis Pipeline: A novel synthesis pipeline covering 1,800+ environments and 85k+ complex instructions that integrates reasoning into tool-use scenarios, improving compliance and generalization in interactive environments.

DeepSeek V3.2 achieved gold-medal performance in the 2025 International Mathematical Olympiad (IMO) and International Olympiad in Informatics (IOI). It is also the first DeepSeek model to integrate thinking directly into tool-use, supporting tool-use in both thinking and non-thinking modes.

DeepSeek V3.2 is now available through different inference providers. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.

DeepSeek V3.2 (Reasoning) API Review Summary

9 API providers benchmarked: Amazon Bedrock, SiliconFlow (FP8), DeepSeek, Google Vertex, Novita, Parasail (FP8), Eigen AI, Nebius Fast, and DeepInfra.
Fastest output speed: Google Vertex at 199.2 t/s, followed by Nebius Fast (135.0 t/s) and Eigen AI (131.2 t/s).
Lowest latency: Google Vertex at 0.76s, followed by DeepInfra (0.82s) and Parasail FP8 (1.40s).
Lowest blended price: DeepInfra at $0.29/1M, Novita ($0.30), SiliconFlow FP8 ($0.31), DeepSeek and Parasail (both $0.32).
Price dispersion: Up to 3.1x across providers — Novita at $0.30 vs Amazon at $0.93.
Speed dispersion: Google Vertex (199.2 t/s) is 11.8x faster than Parasail FP8 (16.9 t/s).
API features: 8 of 9 providers support JSON mode (all except Amazon); 8 of 9 support Function Calling (all except Novita).

DeepSeek V3.2 — Best APIs

Best For	Provider	Speed (t/s)	TTFT (s)	Blended ($/1M)	Context	Func	JSON	Why Notable
Overall Recommendation	DeepInfra	97.0	0.82s	$0.29	164k	Yes	Yes	2nd-lowest TTFT; competitive throughput; lowest blended price with full API feature parity
Best speed + latency	Google Vertex	199.2	0.76s	$0.84	164k	Yes	Yes	Fastest generation and lowest TTFT; best for synchronous user-facing applications
Lowest cost	Novita	33.0	1.41s	$0.30	164k	No	Yes	Lowest blended price; lacks Function Calling — not suitable for agentic workflows
Low cost alternative	SiliconFlow (FP8)	42.0	3.06s	$0.31	164k	Yes	Yes	Near-lowest blended price; higher throughput than other budget options but high TTFT
Official baseline	DeepSeek	34.0	1.88s	$0.32	128k	Yes	Yes	Among lowest blended prices; restricted to 128k context vs 164k standard
High throughput (#2)	Nebius Fast	135.0	2.13s	$0.80	164k	Yes	Yes	High output speed for throughput-heavy workloads; higher price and TTFT
High throughput (#3)	Eigen AI	131.2	1.72s	$0.91	131k	Yes	Yes	Very high output speed; premium price and restricted 131k context window
Low-latency runner-up	Parasail (FP8)	16.9	1.40s	$0.32	164k	Yes	Yes	Low TTFT but severely throttled output (16.9 t/s); E2E up to 149s for 500 tokens
Enterprise platform	Amazon Bedrock	45.1	1.74s	$0.93	128k	Yes	No	Most expensive; restricted 128k context; lacks JSON mode

Quick Verdict: Which DeepSeek V3.2 Provider is Best?

Based on benchmarks across 9 tracked providers, DeepInfra is the recommended API for production-scale DeepSeek V3.2 deployment. It offers the lowest blended price ($0.29/1M), the second-lowest TTFT (0.82s — just 0.06s behind Google Vertex), competitive throughput (97.0 t/s), and full support for both JSON Mode and Function Calling. For maximum raw speed, Google Vertex leads at 199.2 t/s and 0.76s TTFT, but at a 2.9x price premium.

Overall Recommendation: DeepInfra

DeepInfra is the recommended API provider for DeepSeek V3.2, offering the best balance of latency, throughput, cost, and feature completeness across all 9 benchmarked providers.

Latency (TTFT): 0.82s (#2 — only 0.06s behind Google Vertex)
Output Speed: 97.0 t/s
Blended Price: $0.29 / 1M tokens (#1 lowest)
Input Price: $0.26 / 1M tokens
Output Price: $0.38 / 1M tokens
Context Window: 164k tokens
API Features: JSON Mode + Function Calling — both supported

The 9-provider landscape forces developers into difficult trade-offs: Novita sacrifices Function Calling for price, Amazon lacks JSON mode and charges the highest premium, Parasail’s low TTFT collapses under actual generation load, and Google Vertex charges nearly 3x more for its speed advantage. DeepInfra resolves this fragmentation — delivering the lowest blended price on the market alongside near-top latency (0.82s), sustained throughput (97 t/s), and complete API feature support.

For developers building agentic workflows, the combination of full JSON Mode and Function Calling support at the market’s lowest price point makes DeepInfra the only provider that doesn’t require compromising on either cost or capability.

Best for Speed and Latency: Google Vertex

If your application relies on real-time, throughput-intensive streaming, Google Vertex outclasses the competition on raw performance metrics.

Latency (TTFT): 0.76s (#1 — the only provider to break sub-second)
Output Speed: 199.2 t/s (#1 — nearly 12x faster than the slowest provider)
End-to-End (500 tokens): 10.04s – 13.31s
Blended Price: $0.84 / 1M tokens
Context Window: 164k tokens
API Features: JSON Mode + Function Calling

Google Vertex’s end-to-end response time makes it the mandatory choice for synchronous, user-facing chat applications where high latency causes user drop-off. However, at $0.84/1M it is 2.9x more expensive than DeepInfra — a significant cost premium for the 0.06s TTFT and throughput advantage.

Lowest Cost: Novita

For background tasks, batch processing, or cost-sensitive workloads with minimal output requirements, Novita offers the most aggressive pricing.

Latency (TTFT): 1.41s (#3)
Output Speed: 33.0 t/s
Blended Price: $0.30 / 1M tokens (#2 lowest)
Input Price: $0.27 / 1M tokens
Output Price: $0.40 / 1M tokens
Context Window: 164k tokens
API Features: JSON Mode only — Function Calling not supported

Novita’s 1.41s TTFT is competitive, but output generation bottlenecks heavily at 33 t/s. This makes it suitable for tasks requiring large input context with minimal output — such as document classification or sentiment analysis — but developers relying on agentic workflows must look elsewhere, as Novita does not currently support Function Calling.

Low Cost Alternative: SiliconFlow (FP8)

Latency (TTFT): 3.06s
Output Speed: 42.0 t/s
Blended Price: $0.31 / 1M tokens
Context Window: 164k tokens
API Features: JSON Mode + Function Calling

SiliconFlow offers near-lowest blended pricing and higher throughput than other budget options, but its TTFT of 3.06s is one of the highest in the benchmark. It is a viable option for non-interactive batch workloads where cost is the primary constraint and latency is acceptable.

High-Throughput Alternatives: Nebius Fast and Eigen AI

If Google Vertex is outside your cloud ecosystem, Nebius Fast and Eigen AI offer competitive high-throughput infrastructure.

Nebius Fast: 135.0 t/s output, 2.13s TTFT, $0.80/1M, 164k context — full feature support.
Eigen AI: 131.2 t/s output, 1.72s TTFT, $0.91/1M, 131k context — full feature support.

Both providers clear the 100+ t/s threshold necessary for seamless streaming of DeepSeek’s reasoning tokens. Nebius Fast is slightly cheaper and offers a larger context window (164k vs 131k), making it the stronger choice between the two for RAG pipelines. However, both carry a significant price premium over DeepInfra without offering a compelling overall advantage.

Latency Trap: Parasail (FP8)

Parasail represents a cautionary case study in why developers must look beyond TTFT when evaluating API providers.

Latency (TTFT): 1.40s (#2)
Output Speed: 16.9 t/s (#9 — slowest in the benchmark)
End-to-End (500 tokens): 118.43s – 149.44s
Blended Price: $0.32 / 1M tokens

While Parasail’s FP8 quantization delivers a competitive 1.40s TTFT, its inference engine severely throttles during token generation. Taking up to 149 seconds to output 500 tokens makes it entirely unviable for interactive applications despite the attractive latency figure and competitive price point.

Enterprise Platform: Amazon Bedrock

Latency (TTFT): 1.74s
Output Speed: 45.1 t/s
Blended Price: $0.93 / 1M tokens (#1 most expensive)
Context Window: 128k tokens (restricted vs 164k standard)
API Features: Function Calling only — JSON Mode not supported

Amazon Bedrock charges a 3.1x premium over Novita while offering mid-pack throughput and a restricted 128k context window. The lack of native JSON mode support adds friction for structured data extraction pipelines. It is only recommended if strict AWS IAM compliance is a hard architectural requirement.

Conclusion

For the vast majority of DeepSeek V3.2 deployments, DeepInfra is the recommended provider. It offers the market’s lowest blended price ($0.29/1M), second-lowest TTFT (0.82s), competitive throughput (97 t/s), and full JSON Mode and Function Calling support — delivering a complete, no-compromise infrastructure for production-grade reasoning model deployment.

Choose DeepInfra for the best overall value — lowest cost, near-top latency, and full feature support.
Choose Google Vertex if maximum speed and the absolute lowest TTFT are the primary requirements and cost is not a constraint.
Choose Novita or SiliconFlow for budget batch workloads — but note Novita lacks Function Calling and SiliconFlow has high TTFT.
Avoid Parasail for any interactive application — its 149s E2E time for 500 tokens is prohibitive despite the low TTFT figure.
Avoid Amazon Bedrock unless strict AWS compliance requirements make it unavoidable.

Frequently Asked Questions

Which API provider has the lowest latency for DeepSeek V3.2?

Google Vertex offers the lowest TTFT at 0.76 seconds. DeepInfra follows closely at 0.82 seconds.

What is the cheapest DeepSeek V3.2 API provider?

DeepInfra is the most cost-effective option at $0.29 per 1M blended tokens with full feature support. Novita is $0.30 but lacks Function Calling.

Does Amazon Bedrock support JSON mode for DeepSeek V3.2?

No. Amazon Bedrock’s implementation does not currently support JSON mode, though it does support Function Calling.

Which provider offers the best balance of price and performance?

DeepInfra offers the optimal balance — $0.29/1M blended tokens, 0.82s TTFT, 97 t/s output speed, and full support for both JSON Mode and Function Calling.

What is DeepSeek V3.2’s context window?

DeepSeek V3.2 supports up to 164k tokens on most providers. Amazon Bedrock and DeepSeek’s native API are restricted to 128k tokens, and Eigen AI is capped at 131k.

Does DeepSeek V3.2 support tool calling during reasoning mode?

Yes. DeepSeek V3.2 is the first DeepSeek model to integrate thinking directly into tool-use, supporting tool-use in both thinking and non-thinking modes.

DeepInfra Launches Access to NVIDIA Nemotron Models for Vision, Retrieval, and AI SafetyDeepInfra is serving the new, open NVIDIA Nemotron vision language and OCR AI models from day zero of their release. As a leading inference provider committed to performance and cost-efficiency, we're making these cutting-edge models available at the industry's best prices, empowering developers to build specialized AI agents without compromising on budget or performance.

Guaranteed JSON output on Open-Source LLMs.DeepInfra is proud to announce that we have released "JSON mode" across all of our text language models. It is available through the "response_format" object, which currently supports only {"type": "json_object"} Our JSON mode will guarantee that all tokens returned in the output of a langua...

Qwen3.5 397B A17B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 397B A17B Qwen3.5 397B A17B is Alibaba Cloud’s largest and most capable multimodal foundation model, released in February 2026. It features a hybrid Mixture-of-Experts (MoE) architecture with 397 billion total parameters and 17 billion active parameters per inference pass, utilizing 512 experts with a routing mechanism selecting a subset per token. This sparse […]</p>

View all