NVIDIA Nemotron 3 Super on DeepInfra: 120B MoE Model

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.05.25 by DeepInfra

NVIDIA’s Nemotron 3 Super runs 120 billion parameters while activating only 12 billion per token — a ratio that makes a real difference when orchestrating multiple agents in parallel. It’s built on a novel architecture called LatentMoE, a hybrid of Mamba-2, Mixture-of-Experts, and Attention layers designed from the ground up for agentic, reasoning, and long-context workloads. The model supports a context window of up to 1 million tokens, and that’s not just a spec on paper.

On the RULER benchmark at 1 million tokens, Nemotron 3 Super scores 91.75 — compared to 22.30 for GPT-OSS-120B at the same length, which tells you something real about how the architecture holds up under pressure. Beyond long-context, it ships with a configurable reasoning mode that can be toggled on or off at inference time, making it practical across both deep reasoning tasks and lighter conversational use. It was pre-trained on over 25 trillion tokens spanning code, math, science, and general knowledge, post-trained with multi-stage reinforcement learning, and released under NVIDIA’s open commercial license. You can find the full model card and endpoint details on the NVIDIA Nemotron 3 Super model page.

What Makes This Model Different

LatentMoE: a hybrid architecture worth understanding. Nemotron 3 Super uses an architecture NVIDIA calls LatentMoE — a combination of Mamba-2 state space layers, Mixture-of-Experts routing, and standard Attention, augmented with Multi-Token Prediction (MTP) heads. The MoE routing happens in a projected latent dimension rather than full model dimension, which improves compute efficiency per token. The MTP heads use shared weights across prediction steps, enabling native speculative decoding without a separate draft model — which has real implications for inference throughput. For latency and throughput numbers across configurations, see the Nemotron 3 Super API benchmarks.

120B total parameters, 12B active. At inference time, only 12B parameters are activated per token — keeping compute costs close to a 12B dense model while retaining the capacity of a 120B one. It was also the first model in the Nemotron 3 family pre-trained using NVFP4 precision, then released as a BF16 checkpoint. Pre-training covered 25 trillion tokens across code, math, science, and general knowledge, spanning 20 languages and 43 programming languages. If you want to understand where Nemotron 3 Super sits relative to the smaller end of the family, the Nemotron 3 Nano explainer covers the tradeoffs between model sizes well.

Long-context retrieval holds up at 1M tokens. The default context window is 256k (limited by VRAM), but the model supports up to 1M tokens. RULER benchmark scores show strong retention across the range:

Context Length	Nemotron 3 Super	Qwen3.5-122B-A10B	GPT-OSS-120B
RULER @ 256k	96.30	96.74	52.30
RULER @ 512k	95.67	95.95	46.70
RULER @ 1M	91.75	91.33	22.30

At 1M tokens, Nemotron 3 Super edges past Qwen3.5-122B-A10B — and GPT-OSS-120B essentially falls apart at that range.

Math, code, and science benchmarks — a mixed but competitive picture. The model leads on several science and math benchmarks (HMMT Feb25, SciCode), holds competitive on coding (LiveCodeBench v5: 81.19), and shows strength in multilingual software engineering (SWE-Bench Multilingual via OpenHands: 45.78 vs. GPT-OSS-120B’s 30.80). It trails on GPQA and HLE without tools, though tool-augmented scores close some of that gap. For a direct head-to-head on coding and reasoning tasks, the Nemotron 3 Nano vs GPT-OSS-20B comparison offers useful context on how the Nemotron family generally holds up against OpenAI-class models:

Benchmark	Nemotron 3 Super	Qwen3.5-122B-A10B	GPT-OSS-120B
HMMT Feb25 (no tools)	93.67	91.40	90.00
GPQA (with tools)	82.70	—	80.09
LiveCodeBench v5	81.19	78.93	88.00
SciCode (subtask)	42.05	42.00	39.00
SWE-Bench Multilingual	45.78	—	30.80

Configurable reasoning and agentic-first design. The model’s thinking behavior can be toggled per request via chat template parameters (enable_thinking=True/False), with an additional low_effort mode for lighter reasoning tasks. Post-training went through three explicit stages: SFT, then RL via asynchronous GRPO across math, code, science, tool use, and multi-turn conversation, followed by RLHF for conversational quality. It’s compatible with vLLM, SGLang, TRT-LLM, and OpenCode, making it straightforward to drop into existing agent scaffolds. Hardware baseline is 8× H100-80GB, dropping to 2× B200/B300 GPUs due to higher HBM capacity on Blackwell.

Getting Started on DeepInfra

Nemotron 3 Super is available on DeepInfra as a public endpoint under the model ID nvidia/NVIDIA-Nemotron-3-Super-120B-A12B. Pricing is $0.10 per 1M input tokens and $0.50 per 1M output tokens — usage-based, no commitments. For a full breakdown of how that compares across the Nemotron family, the NVIDIA Nemotron API pricing guide is worth a read. If you need a dedicated setup, private endpoint deployment is also available through the DeepInfra dashboard.

The API is OpenAI-compatible — swap your base URL, point to the model, and your existing SDK code works as-is. DeepInfra operates with a zero-retention policy and holds both SOC 2 and ISO 27001 certifications. The API reference for Nemotron 3 Super covers authentication, request parameters, and response schema.

Here’s a minimal example to get your first completion:

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
      "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B",
      "messages": [
        {
          "role": "user",
          "content": "Explain the difference between MoE and dense transformer architectures."
        }
      ]
    }'copy

from openai import OpenAI


client = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)


response = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B",
    messages=[
        {
            "role": "user",
            "content": "Explain the difference between MoE and dense transformer architectures."
        }
    ],
)
print(response.choices[0].message.content)copy

import OpenAI from "openai";


const openai = new OpenAI({
  apiKey: "$DEEPINFRA_TOKEN",
  baseURL: "https://api.deepinfra.com/v1/openai",
});


const response = await openai.chat.completions.create({
  model: "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B",
  messages: [
    {
      role: "user",
      content: "Explain the difference between MoE and dense transformer architectures.",
    },
  ],
});
console.log(response.choices[0].message.content);copy

The model supports tool/function calling on the same endpoint. If you want to explore the broader set of models available — including other members of the Nemotron family — the DeepInfra models page is a good starting point. To grab your API key and get started, head to the Nemotron 3 Super model page.

Conclusion

Nemotron 3 Super is a 120B-parameter model that runs at 12B active cost, holds together at 1M context lengths where competitors don’t, and ships with agentic scaffolding hooks that have historically required separate tooling to wire up. That combination of long-context reliability, configurable reasoning, and native tool use makes it a practical foundation for multi-agent pipelines, complex document workflows, and inference-time compute budgeting at scale.

If you’re building systems where context depth and per-token cost need to coexist without compromise, it’s worth evaluating. The Nemotron 3 Super release post has additional background on design decisions and intended use cases. To get started, visit the model page on DeepInfra.

MiniMax-M2.5 API Benchmarks: Latency, Throughput & Cost<p>About MiniMax-M2.5 MiniMax-M2.5 is a state-of-the-art open-weights large language model released in February 2026. Built on a 230B-parameter Mixture of Experts (MoE) architecture with approximately 10 billion active parameters per forward pass, it features Lightning Attention and supports a context window of up to 205,000 tokens. The model uses extended chain-of-thought reasoning to work through […]</p>

The easiest way to build AI applications with Llama 2 LLMs.The long awaited Llama 2 models are finally here! We are excited to show you how to use them with DeepInfra. These collection of models represent the state of the art in open source language models. They are made available by Meta AI and the l...

Lzlv model for roleplaying and creative workRecently an interesting new model got released. It is called Lzlv, and it is basically a merge of few existing models. This model is using the Vicuna prompt format, so keep this in mind if you are using our raw [API](/lizpreciatior/lzlv_70b...

View all