DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

NVIDIA’s Nemotron 3 Super runs 120 billion parameters while activating only 12 billion per token — a ratio that makes a real difference when orchestrating multiple agents in parallel. It’s built on a novel architecture called LatentMoE, a hybrid of Mamba-2, Mixture-of-Experts, and Attention layers designed from the ground up for agentic, reasoning, and long-context workloads. The model supports a context window of up to 1 million tokens, and that’s not just a spec on paper.
On the RULER benchmark at 1 million tokens, Nemotron 3 Super scores 91.75 — compared to 22.30 for GPT-OSS-120B at the same length, which tells you something real about how the architecture holds up under pressure. Beyond long-context, it ships with a configurable reasoning mode that can be toggled on or off at inference time, making it practical across both deep reasoning tasks and lighter conversational use. It was pre-trained on over 25 trillion tokens spanning code, math, science, and general knowledge, post-trained with multi-stage reinforcement learning, and released under NVIDIA’s open commercial license. You can find the full model card and endpoint details on the NVIDIA Nemotron 3 Super model page.
LatentMoE: a hybrid architecture worth understanding. Nemotron 3 Super uses an architecture NVIDIA calls LatentMoE — a combination of Mamba-2 state space layers, Mixture-of-Experts routing, and standard Attention, augmented with Multi-Token Prediction (MTP) heads. The MoE routing happens in a projected latent dimension rather than full model dimension, which improves compute efficiency per token. The MTP heads use shared weights across prediction steps, enabling native speculative decoding without a separate draft model — which has real implications for inference throughput. For latency and throughput numbers across configurations, see the Nemotron 3 Super API benchmarks.
120B total parameters, 12B active. At inference time, only 12B parameters are activated per token — keeping compute costs close to a 12B dense model while retaining the capacity of a 120B one. It was also the first model in the Nemotron 3 family pre-trained using NVFP4 precision, then released as a BF16 checkpoint. Pre-training covered 25 trillion tokens across code, math, science, and general knowledge, spanning 20 languages and 43 programming languages. If you want to understand where Nemotron 3 Super sits relative to the smaller end of the family, the Nemotron 3 Nano explainer covers the tradeoffs between model sizes well.
Long-context retrieval holds up at 1M tokens. The default context window is 256k (limited by VRAM), but the model supports up to 1M tokens. RULER benchmark scores show strong retention across the range:
| Context Length | Nemotron 3 Super | Qwen3.5-122B-A10B | GPT-OSS-120B |
|---|---|---|---|
| RULER @ 256k | 96.30 | 96.74 | 52.30 |
| RULER @ 512k | 95.67 | 95.95 | 46.70 |
| RULER @ 1M | 91.75 | 91.33 | 22.30 |
At 1M tokens, Nemotron 3 Super edges past Qwen3.5-122B-A10B — and GPT-OSS-120B essentially falls apart at that range.
Math, code, and science benchmarks — a mixed but competitive picture. The model leads on several science and math benchmarks (HMMT Feb25, SciCode), holds competitive on coding (LiveCodeBench v5: 81.19), and shows strength in multilingual software engineering (SWE-Bench Multilingual via OpenHands: 45.78 vs. GPT-OSS-120B’s 30.80). It trails on GPQA and HLE without tools, though tool-augmented scores close some of that gap. For a direct head-to-head on coding and reasoning tasks, the Nemotron 3 Nano vs GPT-OSS-20B comparison offers useful context on how the Nemotron family generally holds up against OpenAI-class models:
| Benchmark | Nemotron 3 Super | Qwen3.5-122B-A10B | GPT-OSS-120B |
|---|---|---|---|
| HMMT Feb25 (no tools) | 93.67 | 91.40 | 90.00 |
| GPQA (with tools) | 82.70 | — | 80.09 |
| LiveCodeBench v5 | 81.19 | 78.93 | 88.00 |
| SciCode (subtask) | 42.05 | 42.00 | 39.00 |
| SWE-Bench Multilingual | 45.78 | — | 30.80 |
Configurable reasoning and agentic-first design. The model’s thinking behavior can be toggled per request via chat template parameters (enable_thinking=True/False), with an additional low_effort mode for lighter reasoning tasks. Post-training went through three explicit stages: SFT, then RL via asynchronous GRPO across math, code, science, tool use, and multi-turn conversation, followed by RLHF for conversational quality. It’s compatible with vLLM, SGLang, TRT-LLM, and OpenCode, making it straightforward to drop into existing agent scaffolds. Hardware baseline is 8× H100-80GB, dropping to 2× B200/B300 GPUs due to higher HBM capacity on Blackwell.
Nemotron 3 Super is available on DeepInfra as a public endpoint under the model ID nvidia/NVIDIA-Nemotron-3-Super-120B-A12B. Pricing is $0.10 per 1M input tokens and $0.50 per 1M output tokens — usage-based, no commitments. For a full breakdown of how that compares across the Nemotron family, the NVIDIA Nemotron API pricing guide is worth a read. If you need a dedicated setup, private endpoint deployment is also available through the DeepInfra dashboard.
The API is OpenAI-compatible — swap your base URL, point to the model, and your existing SDK code works as-is. DeepInfra operates with a zero-retention policy and holds both SOC 2 and ISO 27001 certifications. The API reference for Nemotron 3 Super covers authentication, request parameters, and response schema.
Here’s a minimal example to get your first completion:
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-d '{
"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B",
"messages": [
{
"role": "user",
"content": "Explain the difference between MoE and dense transformer architectures."
}
]
}'from openai import OpenAI
client = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
response = client.chat.completions.create(
model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B",
messages=[
{
"role": "user",
"content": "Explain the difference between MoE and dense transformer architectures."
}
],
)
print(response.choices[0].message.content)import OpenAI from "openai";
const openai = new OpenAI({
apiKey: "$DEEPINFRA_TOKEN",
baseURL: "https://api.deepinfra.com/v1/openai",
});
const response = await openai.chat.completions.create({
model: "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B",
messages: [
{
role: "user",
content: "Explain the difference between MoE and dense transformer architectures.",
},
],
});
console.log(response.choices[0].message.content);The model supports tool/function calling on the same endpoint. If you want to explore the broader set of models available — including other members of the Nemotron family — the DeepInfra models page is a good starting point. To grab your API key and get started, head to the Nemotron 3 Super model page.
Nemotron 3 Super is a 120B-parameter model that runs at 12B active cost, holds together at 1M context lengths where competitors don’t, and ships with agentic scaffolding hooks that have historically required separate tooling to wire up. That combination of long-context reliability, configurable reasoning, and native tool use makes it a practical foundation for multi-agent pipelines, complex document workflows, and inference-time compute budgeting at scale.
If you’re building systems where context depth and per-token cost need to coexist without compromise, it’s worth evaluating. The Nemotron 3 Super release post has additional background on design decisions and intended use cases. To get started, visit the model page on DeepInfra.
GLM-5.1 API Benchmarks: Latency, Throughput & Cost<p>Z.ai’s GLM-5.1 is an April 2026 open-weight reasoning model built for long-horizon agentic engineering — and accessing it effectively means navigating a real spread of provider options. Across 10 benchmarked API providers, blended pricing ranges from $0.74 to $1.70 per 1M tokens, output speed from 33.8 to 175.2 t/s, and the fastest provider is 5.2x […]</p>
DeepSeek V4 Pro Is Now Available on DeepInfra<p>DeepSeek released V4 Pro on April 24, 2026 — a 1.6 trillion-parameter Mixture of Experts model with 49 billion active parameters, a 1-million-token context window, and weights available on Hugging Face under an MIT license. On LiveCodeBench, the V4-Pro-Max reasoning variant scores 93.5 Pass@1, leading every model in the comparison set, including Gemini-3.1-Pro High at […]</p>
GLM-5.1 Pricing Guide: API Cost Comparison & Analysis<p>Provider choice for GLM-5.1 is a real economic decision. Across 10 benchmarked API providers, blended pricing runs from $0.74 to $1.70 per 1M tokens, output speed from 33.8 to 175.2 t/s, and the fastest provider is 5.2x quicker than the slowest. For teams deploying at scale, that spread determines whether this model fits a production […]</p>
© 2026 DeepInfra. All rights reserved.