We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

NVIDIA Nemotron 3 Super: Model Overview & Integration Guide
Published on 2026.05.25 by DeepInfra
NVIDIA Nemotron 3 Super: Model Overview & Integration Guide

The NVIDIA Nemotron 3 Super is a state-of-the-art 120-billion parameter hybrid Mixture-of-Experts (MoE) model designed to bridge the gap between high-compute efficiency and extreme accuracy. Engineered specifically for the next generation of AI development, Nemotron 3 Super excels in multi-agent applications, specialized agentic systems, and complex reasoning tasks. By utilizing a sophisticated architecture that activates only 12 billion parameters at any given time, the model provides the performance of a massive LLM with the agility required for real-time, collaborative AI workflows.

Technical Architecture and Capabilities

NVIDIA has optimized the Nemotron 3 Super (specifically the 120B-A12B variant) to run numerous collaborating agents simultaneously on a single GPU. This is achieved through a Latent Mixture-of-Experts (LatentMoE) framework, which projects tokens into a smaller latent dimension for routing, significantly reducing compute overhead.

Key Technical Features:

  • Hybrid Design: Combines Mamba-2, MoE, and Attention layers with Multi-Token Prediction (MTP) to enable faster text generation and native speculative decoding.
  • Massive Context Window: Supports an expansive context window of up to 1,000,000 (1M) tokens, making it ideal for long-document RAG (Retrieval-Augmented Generation) and multi-turn agentic operations.
  • Agentic Optimization: Native support for JSON outputs and function calling ensures seamless integration with external tools and APIs.
  • Quantization: Pre-trained using NVFP4 quantization, maximizing throughput on modern hardware like NVIDIA H100 and Blackwell systems.

Performance Benchmarks

The Nemotron 3 Super demonstrates specialized capabilities in agentic workflows, scientific reasoning, and autonomous software engineering. It consistently outperforms peer models in its parameter class across critical benchmarks.

Standard Benchmark Performance

Benchmark CategoryBenchmark NameScoreMetric 
General KnowledgeMMLU-Pro83.73Accuracy (%)
ReasoningAIME25 (No Tools)90.21Accuracy (%)
ReasoningHMMT Feb25 (No Tools)93.67Accuracy (%)
CodingLiveCodeBench (v5)81.19Pass@1 (%)
Human PreferenceArena-Hard-V273.88Score

Competitive Comparison (120B Class)

BenchmarkNemotron 3 SuperQwen3.5-122B-A10BGPT-OSS-120B 
MMLU-Pro83.7386.7081.00
HMMT Feb2593.6791.4090.00
RULER @ 1M Context91.7591.3322.30
SWE-Bench60.4766.4041.90

Task-Specific Highlights

  • Autonomous Engineering: Achieved a 60.47% resolution rate on SWE-Bench (OpenHands), optimized for software development agents.
  • Long-Context Retrieval: Maintained 91.75% accuracy at a full 1M token context window on the RULER benchmark.
  • Multilingual Fidelity: Scored 86.67% on WMT24++, demonstrating high-fidelity translation across English, French, German, Italian, Japanese, Spanish, and Chinese.

Getting Started with the API

Nemotron 3 Super is available for public deployment via the DeepInfra inference cloud. DeepInfra provides an OpenAI-compatible endpoint, making it easy for developers to integrate the model into existing applications.

Authentication

Access requires an API key obtained from your DeepInfra dashboard. Include this key in the Authorization header of your requests:

Authorization: Bearer YOUR_DEEPINFRA_API_KEY

API Endpoint Details

  • Base URL: https://api.deepinfra.com/v1/openai/chat/completions
  • HTTP Method: POST
  • Content Type: application/json

Implementation Examples

Python Example

import requests
import os

DEEPINFRA_API_KEY = os.getenv("DEEPINFRA_API_KEY", "YOUR_DEEPINFRA_API_KEY")

url = "https://api.deepinfra.com/v1/openai/chat/completions"
headers = {
    "Authorization": f"Bearer {DEEPINFRA_API_KEY}",
    "Content-Type": "application/json"
}
payload = {
    "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B",
    "messages": [
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Explain the concept of quantum entanglement in simple terms."}
    ],
    "max_new_tokens": 150,
    "temperature": 0.7
}

response = requests.post(url, headers=headers, json=payload)
print(response.json())
copy

cURL Example

curl -X POST \
  https://api.deepinfra.com/v1/openai/chat/completions \
  -H "Authorization: Bearer YOUR_DEEPINFRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B",
    "messages": [
        {"role": "user", "content": "Explain quantum entanglement."}
    ],
    "max_new_tokens": 150
  }'
copy

Pricing and Usage

DeepInfra offers a highly competitive, usage-based pricing model for Nemotron 3 Super, allowing developers to scale from prototyping to enterprise production without massive upfront costs.

  • Input Token Pricing: $0.10 per 1 million tokens.
  • Output Token Pricing: $0.50 per 1 million tokens.
  • Usage-Based: You only pay for the tokens processed.
  • Tier-Based Pricing: For high-volume enterprise needs, refer to the official DeepInfra pricing page for the most up-to-date information on rate limits and tiers.

Hardware Requirements for Local Deployment

For users looking to deploy the model on private infrastructure, the following hardware configurations are recommended:

* Minimum: 8× NVIDIA H100-80GB GPUs.

* Optimized: Fully compatible with NVIDIA Grace Blackwell (GB200) systems. On B200/B300 hardware, the BF16 checkpoint can fit on as few as 2 GPUs due to increased HBM capacity.

Conclusion

The NVIDIA Nemotron 3 Super represents a significant milestone in Mixture-of-Experts technology. By combining a massive 120B parameter knowledge base with a highly efficient 12B active parameter execution, it offers a unique value proposition: enterprise-grade reasoning and multi-agent collaboration at a fraction of the traditional compute cost. Whether you are building autonomous software agents or processing million-token documents, Nemotron 3 Super provides the accuracy and efficiency required for modern AI systems.

For the latest updates and community milestones, visit the official NVIDIA news section or the DeepInfra blog.

Related articles
DeepInfra Launches Access to NVIDIA Nemotron Models for Vision, Retrieval, and AI SafetyDeepInfra Launches Access to NVIDIA Nemotron Models for Vision, Retrieval, and AI SafetyDeepInfra is serving the new, open NVIDIA Nemotron vision language and OCR AI models from day zero of their release. As a leading inference provider committed to performance and cost-efficiency, we're making these cutting-edge models available at the industry's best prices, empowering developers to build specialized AI agents without compromising on budget or performance.
Kimi K2.6 Model Overview: Architecture, Features & CapabilitiesKimi K2.6 Model Overview: Architecture, Features & Capabilities<p>Kimi K2.6 is Moonshot AI&#8217;s latest flagship open-source model, released on April 20, 2026 under a Modified MIT license. It is a native multimodal agentic model built on a 1-trillion parameter Mixture-of-Experts (MoE) architecture, with 32 billion parameters activated per token. The model is designed for long-horizon coding, autonomous execution, and multi-agent orchestration, and is [&hellip;]</p>
Introducing NVIDIA Nemotron 3 Nano Omni on DeepInfraIntroducing NVIDIA Nemotron 3 Nano Omni on DeepInfraDeepInfra is an official launch partner for NVIDIA Nemotron 3 Nano Omni, the first multimodal model in the Nemotron 3 family — a single open model that understands images, video, audio, documents, and text in one unified inference pass.