Nemotron 3 Nano Omni — the first multimodal model in the Nemotron 3 family, now on DeepInfra!

DeepSeek released V4 Pro on April 24, 2026 — a 1.6 trillion-parameter Mixture of Experts model with 49 billion active parameters, a 1-million-token context window, and weights available on Hugging Face under an MIT license. On LiveCodeBench, the V4-Pro-Max reasoning variant scores 93.5 Pass@1, leading every model in the comparison set, including Gemini-3.1-Pro High at 91.7 and Claude Opus-4.6 Max at 88.8. It’s built for the workloads where reasoning depth actually matters: competitive coding, advanced math, long-context retrieval, and long-running agent tasks.
The architectural story is worth paying attention to. V4 Pro combines Compressed Sparse Attention and Heavily Compressed Attention in a hybrid design that, at 1M-token context, requires only 27% of the single-token inference FLOPs and 10% of the KV cache compared to DeepSeek-V3.2 — a real efficiency gain at that scale. Developers get direct control over reasoning depth through three built-in modes: a fast non-thinking mode for routine tasks, a mid-tier logical analysis mode, and a full extended chain-of-thought mode (V4-Pro-Max) for maximum accuracy on hard problems. Pre-trained on more than 32 trillion tokens and post-trained through a two-stage pipeline combining SFT, RL with GRPO, and on-policy distillation, this is a serious open-weights model competing at the frontier. It’s now available to run via DeepInfra’s managed inference infrastructure.
DeepSeek V4 Pro is a 1.6 trillion parameter Mixture of Experts model with only 49B active parameters per token. That ratio matters: you’re getting the representational capacity of a 1.6T model at the inference cost of a much smaller one. The model is trained in FP4/FP8 mixed precision — MoE expert weights run in FP4, most other parameters in FP8 — which enables serving at scale without sacrificing the parameter count that drives strong benchmark results.
The most significant architectural upgrade over DeepSeek V3.2 is the Hybrid Attention Architecture, combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). The practical upshot: at the 1M-token context length, V4 Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache compared to V3.2. That’s a meaningful efficiency gain for long-context workloads. Two other architectural additions round this out — Manifold-Constrained Hyper-Connections (mHC) stabilize gradient flow across layers, and the Muon optimizer replaces AdamW for pre-training, improving convergence stability.
The model supports three explicit reasoning modes, giving you control over the compute/quality tradeoff at inference time:
Benchmark performance positions V4-Pro-Max competitively against closed-source frontier models:
| Benchmark | V4-Pro-Max | GPT-5.4 xHigh | Gemini 3.1 Pro High | Opus-4.6 Max |
|---|---|---|---|---|
| LiveCodeBench (Pass@1) | 93.5 | — | 91.7 | 88.8 |
| Codeforces Rating | 3206 | 3168 | 3052 | — |
| IMOAnswerBench (Pass@1) | 89.8 | 91.4 | 81.0 | 75.3 |
| GPQA Diamond (Pass@1) | 90.1 | 93.0 | 94.3 | 91.3 |
| SWE Verified (Resolved) | 80.6 | — | 80.6 | 80.8 |
| Apex Shortlist (Pass@1) | 90.2 | — | 89.1 | 85.9 |
Where it leads is coding (LiveCodeBench, Codeforces) and agentic tasks. On math and knowledge benchmarks it’s competitive but not always first.
A few practical constraints to keep in mind: the model is text-only (no image input) — if your pipeline requires multimodal inputs, DeepInfra’s multimodal model catalog covers that use case separately. Artificial Analysis clocked output speed at 34.6 tokens/sec — slower than the comparable model median of 53.1 t/s. It also generates substantially more tokens during reasoning-heavy evaluations (~190M tokens vs. a peer average of 45M), which compounds latency on long tasks. Time to first token averages 1.85s, which is better than the 2.33s peer median. The model weights are publicly available on Hugging Face under an MIT license, so self-hosting is an option if throughput is a hard requirement — though that comes with its own infrastructure overhead.
DeepSeek V4 Pro is available now through DeepInfra’s text generation model catalog. Pricing is usage-based: $1.74 per 1M input tokens, $3.48 per 1M output tokens, and $0.145 per 1M cached input tokens — the cached input rate is worth keeping in mind given the model’s 1M-token context window and its natural fit for long-running agentic workloads where prompt reuse adds up fast.
DeepInfra gives you a fully managed, OpenAI-compatible endpoint — no infrastructure to provision, no model weights to download and serve yourself. You swap in your DeepInfra token and point the base URL to https://api.deepinfra.com/v1/openai, and everything else stays the same. The model is served at FP4 precision and supports JSON output and function calling out of the box. If your workload demands dedicated compute, GPU instances are available for private deployments with no wait times or complex configuration.
DeepInfra operates with a zero-retention data policy and is SOC 2 and ISO 27001 certified.
Quickstart
Grab your API key from the DeepInfra dashboard and make your first call:
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Pro",
"messages": [
{
"role": "user",
"content": "Explain the difference between CSA and HCA attention in one paragraph."
}
]
}'from openai import OpenAI
client = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V4-Pro",
messages=[{"role": "user", "content": "Explain the difference between CSA and HCA attention in one paragraph."}],
)
print(response.choices[0].message.content)import OpenAI from "openai";
const openai = new OpenAI({
apiKey: "$DEEPINFRA_TOKEN",
baseURL: "https://api.deepinfra.com/v1/openai",
});
const response = await openai.chat.completions.create({
model: "deepseek-ai/DeepSeek-V4-Pro",
messages: [{ role: "user", content: "Explain the difference between CSA and HCA attention in one paragraph." }],
});
console.log(response.choices[0].message.content);The model exposes three reasoning effort modes — Non-think for fast responses, Think High for deliberate step-by-step reasoning, and Think Max (the V4-Pro-Max variant) for full extended thinking via a dedicated system prompt. If you’re running the Max effort variant, expect significantly more output tokens; factor that into your cost estimates before you deploy. The DeepSeek V4 Pro API reference documents all supported parameters, including how to set reasoning modes and configure function calling.
DeepSeek V4 Pro is a capable open-weights model that holds its own at the frontier on the workloads developers actually care about — competitive coding, long-context reasoning, and multi-step agent tasks — while delivering architectural efficiency gains that make the 1M-token context window practical rather than theoretical. The MIT license means self-hosting is on the table, but the managed endpoint removes the operational overhead if you’d rather ship than serve.
Controllable reasoning depth lets you tune the compute/quality tradeoff per request instead of committing to a single behavior across your entire application — which opens up real design space for agents that need to switch between fast lookups and deep deliberation within the same workflow. Head to the DeepSeek V4 Pro model page to run it in the interactive demo, or grab your API key and start building.
Llama 3.1 70B Instruct API from DeepInfra: Snappy Starts, Fair Pricing, Production Fit - Deep Infra<p>Llama 3.1 70B Instruct is Meta’s widely-used, instruction-tuned model for high-quality dialogue and tool use. With a ~131K-token context window, it can read long prompts and multi-file inputs—great for agents, RAG, and IDE assistants. But how “good” it feels in practice depends just as much on the inference provider as on the model: infra, batching, […]</p>
Kimi K2.6 Pricing Guide 2026: Compare Costs & Deployment Strategies<p>Kimi K2.6 matters because it sits in a rare spot: open weights, broad provider availability, and a real spread in pricing and runtime performance depending on where you buy it. Artificial Analysis tracks the model across nine API providers, with blended pricing ranging from $1.15 to $2.15 per 1M tokens and major differences in throughput […]</p>
Qwen API Pricing Guide 2026: Max Performance on a Budget<p>If you have been following the AI leaderboards lately, you have likely noticed a new name constantly trading blows with GPT-4o and Claude 3.5 Sonnet: Qwen. Developed by Alibaba Cloud, the Qwen model family (specifically Qwen 2.5 and Qwen 3) has exploded in popularity for one simple reason: unbeatable price-to-performance. In 2025, Qwen is widely […]</p>
© 2026 Deep Infra. All rights reserved.