DeepSeek V4 Pro Is Now Available on DeepInfra

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.04.30 by DeepInfra

DeepSeek released V4 Pro on April 24, 2026 — a 1.6 trillion-parameter Mixture of Experts model with 49 billion active parameters, a 1-million-token context window, and weights available on Hugging Face under an MIT license. On LiveCodeBench, the V4-Pro-Max reasoning variant scores 93.5 Pass@1, leading every model in the comparison set, including Gemini-3.1-Pro High at 91.7 and Claude Opus-4.6 Max at 88.8. It’s built for the workloads where reasoning depth actually matters: competitive coding, advanced math, long-context retrieval, and long-running agent tasks.

The architectural story is worth paying attention to. V4 Pro combines Compressed Sparse Attention and Heavily Compressed Attention in a hybrid design that, at 1M-token context, requires only 27% of the single-token inference FLOPs and 10% of the KV cache compared to DeepSeek-V3.2 — a real efficiency gain at that scale. Developers get direct control over reasoning depth through three built-in modes: a fast non-thinking mode for routine tasks, a mid-tier logical analysis mode, and a full extended chain-of-thought mode (V4-Pro-Max) for maximum accuracy on hard problems. Pre-trained on more than 32 trillion tokens and post-trained through a two-stage pipeline combining SFT, RL with GRPO, and on-policy distillation, this is a serious open-weights model competing at the frontier. It’s now available to run via DeepInfra’s managed inference infrastructure.

What Makes This Model Different

DeepSeek V4 Pro is a 1.6 trillion parameter Mixture of Experts model with only 49B active parameters per token. That ratio matters: you’re getting the representational capacity of a 1.6T model at the inference cost of a much smaller one. The model is trained in FP4/FP8 mixed precision — MoE expert weights run in FP4, most other parameters in FP8 — which enables serving at scale without sacrificing the parameter count that drives strong benchmark results.

The most significant architectural upgrade over DeepSeek V3.2 is the Hybrid Attention Architecture, combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). The practical upshot: at the 1M-token context length, V4 Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache compared to V3.2. That’s a meaningful efficiency gain for long-context workloads. Two other architectural additions round this out — Manifold-Constrained Hyper-Connections (mHC) stabilize gradient flow across layers, and the Muon optimizer replaces AdamW for pre-training, improving convergence stability.

The model supports three explicit reasoning modes, giving you control over the compute/quality tradeoff at inference time:

Non-think — fast, direct responses; uses a </think> summary format
Think High — full chain-of-thought reasoning for complex tasks
Think Max (V4-Pro-Max) — maximum reasoning effort via a dedicated system prompt with full <think> blocks

Benchmark performance positions V4-Pro-Max competitively against closed-source frontier models:

Benchmark	V4-Pro-Max	GPT-5.4 xHigh	Gemini 3.1 Pro High	Opus-4.6 Max
LiveCodeBench (Pass@1)	93.5	—	91.7	88.8
Codeforces Rating	3206	3168	3052	—
IMOAnswerBench (Pass@1)	89.8	91.4	81.0	75.3
GPQA Diamond (Pass@1)	90.1	93.0	94.3	91.3
SWE Verified (Resolved)	80.6	—	80.6	80.8
Apex Shortlist (Pass@1)	90.2	—	89.1	85.9

Where it leads is coding (LiveCodeBench, Codeforces) and agentic tasks. On math and knowledge benchmarks it’s competitive but not always first.

A few practical constraints to keep in mind: the model is text-only (no image input) — if your pipeline requires multimodal inputs, DeepInfra’s multimodal model catalog covers that use case separately. Artificial Analysis clocked output speed at 34.6 tokens/sec — slower than the comparable model median of 53.1 t/s. It also generates substantially more tokens during reasoning-heavy evaluations (~190M tokens vs. a peer average of 45M), which compounds latency on long tasks. Time to first token averages 1.85s, which is better than the 2.33s peer median. The model weights are publicly available on Hugging Face under an MIT license, so self-hosting is an option if throughput is a hard requirement — though that comes with its own infrastructure overhead.

Getting Started on DeepInfra

DeepSeek V4 Pro is available now through DeepInfra’s text generation model catalog. Pricing is usage-based: $1.74 per 1M input tokens, $3.48 per 1M output tokens, and $0.145 per 1M cached input tokens — the cached input rate is worth keeping in mind given the model’s 1M-token context window and its natural fit for long-running agentic workloads where prompt reuse adds up fast.

DeepInfra gives you a fully managed, OpenAI-compatible endpoint — no infrastructure to provision, no model weights to download and serve yourself. You swap in your DeepInfra token and point the base URL to https://api.deepinfra.com/v1/openai, and everything else stays the same. The model is served at FP4 precision and supports JSON output and function calling out of the box. If your workload demands dedicated compute, GPU instances are available for private deployments with no wait times or complex configuration.

DeepInfra operates with a zero-retention data policy and is SOC 2 and ISO 27001 certified.

Quickstart

Grab your API key from the DeepInfra dashboard and make your first call:

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
      "model": "deepseek-ai/DeepSeek-V4-Pro",
      "messages": [
        {
          "role": "user",
          "content": "Explain the difference between CSA and HCA attention in one paragraph."
        }
      ]
    }'copy

from openai import OpenAI


client = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)


response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Pro",
    messages=[{"role": "user", "content": "Explain the difference between CSA and HCA attention in one paragraph."}],
)
print(response.choices[0].message.content)copy

import OpenAI from "openai";


const openai = new OpenAI({
  apiKey: "$DEEPINFRA_TOKEN",
  baseURL: "https://api.deepinfra.com/v1/openai",
});


const response = await openai.chat.completions.create({
  model: "deepseek-ai/DeepSeek-V4-Pro",
  messages: [{ role: "user", content: "Explain the difference between CSA and HCA attention in one paragraph." }],
});
console.log(response.choices[0].message.content);copy

The model exposes three reasoning effort modes — Non-think for fast responses, Think High for deliberate step-by-step reasoning, and Think Max (the V4-Pro-Max variant) for full extended thinking via a dedicated system prompt. If you’re running the Max effort variant, expect significantly more output tokens; factor that into your cost estimates before you deploy. The DeepSeek V4 Pro API reference documents all supported parameters, including how to set reasoning modes and configure function calling.

Conclusion

DeepSeek V4 Pro is a capable open-weights model that holds its own at the frontier on the workloads developers actually care about — competitive coding, long-context reasoning, and multi-step agent tasks — while delivering architectural efficiency gains that make the 1M-token context window practical rather than theoretical. The MIT license means self-hosting is on the table, but the managed endpoint removes the operational overhead if you’d rather ship than serve.

Controllable reasoning depth lets you tune the compute/quality tradeoff per request instead of committing to a single behavior across your entire application — which opens up real design space for agents that need to switch between fast lookups and deep deliberation within the same workflow. Head to the DeepSeek V4 Pro model page to run it in the interactive demo, or grab your API key and start building.

How to Use OpenClaw with DeepInfra: Setup & Workflow Guide<p>When you first learn how to use OpenClaw, the onboarding flow asks for an API key and points you toward Anthropic or OpenAI. Reasonable starting point. For production agents running dozens of tasks a day, it’s an expensive one. OpenClaw works with any OpenAI-compatible API, so you can swap the default model for an open-weight […]</p>

Pricing 101: Token Math & Cost-Per-Completion Explained<p>LLM pricing can feel opaque until you translate it into a few simple numbers: input tokens, output tokens, and price per million. Every request you send—system prompt, chat history, RAG context, tool-call JSON—counts as input; everything the model writes back counts as output. Once you know those two counts, the cost of a completion is […]</p>

DeepInfra Raises $107M Series B to Scale Inference InfrastructureDeepInfra has raised $107 million in Series B funding to scale its inference cloud, expand global capacity, and support the next generation of open-source and agentic AI workloads.

View all