We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

Nemotron 3 Nano Omni — the first multimodal model in the Nemotron 3 family, now on DeepInfra!

DeepSeek V4 Pro: Model Overview, Features & Performance Guide
Published on 2026.04.30 by DeepInfra
DeepSeek V4 Pro: Model Overview, Features & Performance Guide

DeepSeek V4 Pro is a 1.6-trillion parameter Mixture-of-Experts (MoE) model from DeepSeek, released on April 24, 2026 under the MIT license. It is designed for advanced reasoning, complex software engineering, and long-running agentic tasks, and arrives alongside DeepSeek-V4-Flash, a lighter 284B-parameter variant built for faster, lower-cost inference. The V4 series is DeepSeek’s first two-tier lineup and introduces a new architecture — the first from the lab since V3. Both models are hybrid thinking/non-thinking and support a 1 million token context window.

Architectural Innovations

The V4 series is built on several technical advances over DeepSeek-V3.2:

  • Hybrid Attention Architecture: Combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). At 1M-token context, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared to V3.2 — a meaningful efficiency gain for long-context production workloads.
  • Manifold-Constrained Hyper-Connections (mHC): Stabilizes signal propagation across the model’s deep layer stack without sacrificing expressivity.
  • Muon Optimizer: Delivers faster convergence and improved training stability across a dataset exceeding 32 trillion tokens.
  • Mixed Precision Training: MoE expert parameters use FP4 precision; most other parameters use FP8. This balance maximizes memory efficiency without compromising performance.

Performance and Benchmarks

The V4-Pro-Base model shows consistent improvements over V3.2 across standard academic benchmarks:

Benchmark (Metric)DeepSeek-V3.2-BaseDeepSeek-V4-Flash-BaseDeepSeek-V4-Pro-Base
MMLU (EM)87.888.790.1
MMLU-Pro (EM)65.568.373.5
GSM8K (8-shot)91.190.892.6
HumanEval (Pass@1)62.869.576.8

In its maximum reasoning effort mode (V4-Pro-Max), the model competes directly with leading closed-source systems:

Benchmark (Metric)DS-V4-Pro MaxGPT-5.4 xHighGemini-3.1-Pro HighOpus-4.6 Max
LiveCodeBench (Pass@1)93.591.788.8
GPQA Diamond (Pass@1)90.193.094.391.3
SWE Verified (Resolved)80.680.680.8

A few additional results worth noting:

  • Competitive programming: V4-Pro-Max achieves a 3206 Codeforces Rating, ahead of Gemini-3.1-Pro High.
  • Agentic real-world tasks: The model leads open-weights models on the GDPval-AA benchmark with a score of 1554, ahead of Kimi K2.6 (1484), GLM-5.1 (1535), and MiniMax-M2.7 (1514). On the Artificial Analysis Intelligence Index, V4 Pro ranks #2 among open-weights reasoning models, behind only Kimi K2.6 (54 vs 52).
  • Long-context retrieval: On MRCR 1M, the model achieves 83.5, demonstrating solid retrieval accuracy across the full 1M-token window — though Claude Opus 4.6 leads on this specific benchmark.
  • Hallucination tendency: V4 Pro has a 94% hallucination rate on the AA-Omniscience benchmark, meaning when the model does not know an answer, it nearly always responds anyway rather than abstaining. This is a specific known limitation on unknown-answer tasks and is worth accounting for in production use cases where confidence calibration matters.

Getting Started with the API

DeepSeek-V4-Pro is available for immediate integration via the DeepInfra platform under the model identifier deepseek-ai/DeepSeek-V4-Pro. Access the model at deepinfra.com/deepseek-ai/DeepSeek-V4-Pro.

Reasoning Modes

A key feature of DeepSeek V4 is configurable reasoning depth. Developers can select the level of thinking effort per request, trading latency for analytical depth:

Reasoning ModeCharacteristicsTypical Use Cases
Non-thinkFast, intuitive, low-latencyRoutine tasks, simple chat, low-risk decisions
Think HighLogical analysis, moderate latencyComplex problem-solving, planning, coding
Think MaxMaximum reasoning depthHard agentic tasks, boundary-pushing logic

Response Format

The model’s output structure changes based on the selected mode, using <think> tags to encapsulate internal chain-of-thought reasoning:

  • Non-think: Outputs </think> [summary] — the closing tag without an opener signals that the thinking block was skipped.
  • Think High / Think Max: Outputs <think> [thinking process] </think> [summary] — the full chain-of-thought is enclosed before the final response.

JSON output is supported across all modes. The thinking and summary content are embedded within the standard JSON response body.

Pricing

DeepSeek V4 Pro is available on DeepInfra with usage-based pricing calculated per million tokens:

Token TypePrice per 1M Tokens
Input Tokens$1.74
Output Tokens$3.48
Cached Input Tokens$0.145

A note on cost in practice: Think Max mode is token-intensive. On the Artificial Analysis Intelligence Index, V4 Pro (Max) used approximately 190M output tokens — far above the median of 47M for comparable open-weights models — bringing the total benchmark run cost to $1,071. That is still more than 4x cheaper than running the same benchmark on Claude Opus 4.7 ($4,811). For general output token pricing, the gap is larger: at $3.48/1M output tokens versus $25/1M for Claude Opus 4.7, V4 Pro is approximately 7x cheaper on output. For applications where Think Max mode generates long responses, monitoring output token usage is important.

Next Steps for Developers

  • Explore the API and model page at deepinfra.com/deepseek-ai/DeepSeek-V4-Pro
  • Download model weights for self-hosting from Hugging Face or ModelScope — both base and instruct variants are available under the MIT license.
  • Review the DeepInfra Pricing Page for current rates and any tier-specific details.
  • For authentication and private endpoint setup, refer to the DeepInfra Dashboard.
Related articles
DeepSeek V4 Pro Pricing Guide 2026: Pricing, Providers & Cost ComparisonDeepSeek V4 Pro Pricing Guide 2026: Pricing, Providers & Cost Comparison<p>DeepSeek V4 Pro matters because it pushes two levers developers actually care about at the same time: open-weight availability and a very competitive provider market. As of the research here, DeepSeek V4 Pro Max is tracked across six API providers, and five of them cluster at the same blended price of $2.17 per 1M tokens [&hellip;]</p>
Open vs Closed Source AI Models: Intelligence, Price & Speed ComparedOpen vs Closed Source AI Models: Intelligence, Price & Speed Compared<p>The LLM landscape in 2026 looks nothing like it did two years ago. Back then the assumption was simple: if you wanted the best model, you paid OpenAI or Anthropic, and that was that. Open source models were a respectable second tier, good for experimentation, fine-tuning, and budget workloads, but not quite there for serious [&hellip;]</p>
Step 3.5 Flash API Benchmarks: Latency, Throughput & CostStep 3.5 Flash API Benchmarks: Latency, Throughput & Cost<p>About Step 3.5 Flash Step 3.5 Flash is an open-weights reasoning model released in February 2026 by StepFun. It leverages a sparse Mixture of Experts (MoE) architecture with 196 billion total parameters and only 11 billion active parameters per token during inference — delivering state-of-the-art performance at a fraction of the cost of dense models. [&hellip;]</p>