DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

DeepSeek V4 Pro is a 1.6-trillion parameter Mixture-of-Experts (MoE) model from DeepSeek, released on April 24, 2026 under the MIT license. It is designed for advanced reasoning, complex software engineering, and long-running agentic tasks, and arrives alongside DeepSeek-V4-Flash, a lighter 284B-parameter variant built for faster, lower-cost inference. The V4 series is DeepSeek’s first two-tier lineup and introduces a new architecture — the first from the lab since V3. Both models are hybrid thinking/non-thinking and support a 1 million token context window.
The V4 series is built on several technical advances over DeepSeek-V3.2:
The V4-Pro-Base model shows consistent improvements over V3.2 across standard academic benchmarks:
| Benchmark (Metric) | DeepSeek-V3.2-Base | DeepSeek-V4-Flash-Base | DeepSeek-V4-Pro-Base |
|---|---|---|---|
| MMLU (EM) | 87.8 | 88.7 | 90.1 |
| MMLU-Pro (EM) | 65.5 | 68.3 | 73.5 |
| GSM8K (8-shot) | 91.1 | 90.8 | 92.6 |
| HumanEval (Pass@1) | 62.8 | 69.5 | 76.8 |
In its maximum reasoning effort mode (V4-Pro-Max), the model competes directly with leading closed-source systems:
| Benchmark (Metric) | DS-V4-Pro Max | GPT-5.4 xHigh | Gemini-3.1-Pro High | Opus-4.6 Max |
|---|---|---|---|---|
| LiveCodeBench (Pass@1) | 93.5 | — | 91.7 | 88.8 |
| GPQA Diamond (Pass@1) | 90.1 | 93.0 | 94.3 | 91.3 |
| SWE Verified (Resolved) | 80.6 | — | 80.6 | 80.8 |
A few additional results worth noting:
DeepSeek-V4-Pro is available for immediate integration via the DeepInfra platform under the model identifier deepseek-ai/DeepSeek-V4-Pro. Access the model at deepinfra.com/deepseek-ai/DeepSeek-V4-Pro.
Reasoning Modes
A key feature of DeepSeek V4 is configurable reasoning depth. Developers can select the level of thinking effort per request, trading latency for analytical depth:
| Reasoning Mode | Characteristics | Typical Use Cases |
|---|---|---|
| Non-think | Fast, intuitive, low-latency | Routine tasks, simple chat, low-risk decisions |
| Think High | Logical analysis, moderate latency | Complex problem-solving, planning, coding |
| Think Max | Maximum reasoning depth | Hard agentic tasks, boundary-pushing logic |
Response Format
The model’s output structure changes based on the selected mode, using <think> tags to encapsulate internal chain-of-thought reasoning:
JSON output is supported across all modes. The thinking and summary content are embedded within the standard JSON response body.
DeepSeek V4 Pro is available on DeepInfra with usage-based pricing calculated per million tokens:
| Token Type | Price per 1M Tokens |
|---|---|
| Input Tokens | $1.74 |
| Output Tokens | $3.48 |
| Cached Input Tokens | $0.145 |
A note on cost in practice: Think Max mode is token-intensive. On the Artificial Analysis Intelligence Index, V4 Pro (Max) used approximately 190M output tokens — far above the median of 47M for comparable open-weights models — bringing the total benchmark run cost to $1,071. That is still more than 4x cheaper than running the same benchmark on Claude Opus 4.7 ($4,811). For general output token pricing, the gap is larger: at $3.48/1M output tokens versus $25/1M for Claude Opus 4.7, V4 Pro is approximately 7x cheaper on output. For applications where Think Max mode generates long responses, monitoring output token usage is important.
Inference Economics: True AI Costs at Scale<p>Most teams discover their inference economics the same way: a production bill arrives that looks nothing like the number they expected. The per-token price seemed small enough during testing. Then real traffic showed up, agents started chaining calls, RAG pipelines bloated the context window, and suddenly the math looked completely different. Token prices have fallen […]</p>
GLM-4.6 vs DeepSeek-V3.2: Performance, Benchmarks & DeepInfra Results<p>The open-source LLM ecosystem has evolved rapidly, and two models stand out as leaders in capability, efficiency, and practical usability: GLM-4.6, Zhipu AI’s high-capacity reasoning model with a 200k-token context window, and DeepSeek-V3.2, a sparsely activated Mixture-of-Experts architecture engineered for exceptional performance per dollar. Both models are powerful. Both are versatile. Both are widely adopted […]</p>
Introducing NVIDIA Nemotron 3 Nano Omni on DeepInfraDeepInfra is an official launch partner for NVIDIA Nemotron 3 Nano Omni, the first multimodal model in the Nemotron 3 family — a single open model that understands images, video, audio, documents, and text in one unified inference pass.© 2026 DeepInfra. All rights reserved.