Nemotron 3 Nano Omni — the first multimodal model in the Nemotron 3 family, now on DeepInfra!

The Nemotron family spans Omni, Nano, Super, and specialized instruct variants, enabling you to
balance accuracy, reasoning depth, latency, and cost for your specific workload.
Omni for multimodal reasoning across text, audio, and video
Nano for maximum efficiency and stable inference
Super for multi-agent systems and advanced reasoning
Instruct variants for instruction-following and conversational workloads
Nemotron 3 Nano Omni is an open multimodal model built on a hybrid Mixture-of-Experts (MoE) architecture, engineered for high efficiency and strong accuracy across image, video, audio, and text inputs. It powers always-on sub-agents for computer use, document intelligence, and audio-video understanding—replacing fragmented vision, speech, and language pipelines with a single unified inference pass.
Price per 1M input tokens
$0.20
Price per 1M output tokens
$0.80
Release Date
04/28/2026
Context Size
262,144
Quantization
bfloat16
# Assume openai>=1.0.0
from openai import OpenAI
# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
chat_completion = openai.chat.completions.create(
model="nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning",
messages=[{"role": "user", "content": "Hello"}],
)
print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)
# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
NVIDIA Nemotron 3 Super is a hybrid Mixture-of-Experts (MoE) model engineered for highest compute efficiency and accuracy in multi-agent applications and specialized agentic systems. It is optimized to run many collaborating agents per application on a single GPU, delivering high accuracy for reasoning, tool use, and instruction following.
Price per 1M input tokens
$0.10
Price per 1M output tokens
$0.50
Release Date
03/10/2026
Context Size
262,144
Quantization
bfloat16
# Assume openai>=1.0.0
from openai import OpenAI
# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
chat_completion = openai.chat.completions.create(
model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B",
messages=[{"role": "user", "content": "Hello"}],
)
print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)
# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
NVIDIA Nemotron 3 Nano is an open small reasoning model optimized for fast, cost-efficient inference in agentic and production workloads. Built with a hybrid Mixture-of-Experts (MoE) and Mamba-Transformer architecture, it delivers strong multi-step reasoning, high token throughput, stable latency with predictable cost, and efficient deployment for agent-based systems. Designed for real-world AI systems where reasoning can generate significantly more tokens per prompt, Nemotron Nano reduces compute cost while maintaining strong reasoning quality.
Price per 1M input tokens
$0.05
Price per 1M output tokens
$0.20
Release Date
12/15/2025
Context Size
262,144
Quantization
fp4
# Assume openai>=1.0.0
from openai import OpenAI
# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
chat_completion = openai.chat.completions.create(
model="nvidia/Nemotron-3-Nano-30B-A3B",
messages=[{"role": "user", "content": "Hello"}],
)
print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)
# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
The Nemotron family spans Nano, Super, and specialized instruct variants, enabling you to balance accuracy, reasoning depth, latency, and cost for your specific workload.
| Model | Context | $ per 1M input tokens | $ per 1M output tokens | Actions |
|---|---|---|---|---|
| NVIDIA-Nemotron-3-Super-120B-A12B | 256k | $0.10 | $0.50 | |
| Nemotron-3-Nano-30B-A3B | 256k | $0.05 | $0.20 | |
| Llama-3.3-Nemotron-Super-49B-v1.5 | 128k | $0.10 | $0.40 | |
| NVIDIA-Nemotron-Nano-9B-v2 | 128k | $0.04 | $0.16 |
© 2026 Deep Infra. All rights reserved.