We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

Nemotron 3 Nano Omni — the first multimodal model in the Nemotron 3 family, now on DeepInfra!

nvidia logo

Nemotron Model Family

The Nemotron family spans Omni, Nano, Super, and specialized instruct variants, enabling you to

balance accuracy, reasoning depth, latency, and cost for your specific workload.

Omni for multimodal reasoning across text, audio, and video

Nano for maximum efficiency and stable inference

Super for multi-agent systems and advanced reasoning

Instruct variants for instruction-following and conversational workloads

Featured Model: nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning

Nemotron 3 Nano Omni is an open multimodal model built on a hybrid Mixture-of-Experts (MoE) architecture, engineered for high efficiency and strong accuracy across image, video, audio, and text inputs. It powers always-on sub-agents for computer use, document intelligence, and audio-video understanding—replacing fragmented vision, speech, and language pipelines with a single unified inference pass.

Price per 1M input tokens

$0.20


Price per 1M output tokens

$0.80


Release Date

04/28/2026


Context Size

262,144


Quantization

bfloat16


# Assume openai>=1.0.0
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning",
    messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
copy

Featured Model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B

NVIDIA Nemotron 3 Super is a hybrid Mixture-of-Experts (MoE) model engineered for highest compute efficiency and accuracy in multi-agent applications and specialized agentic systems. It is optimized to run many collaborating agents per application on a single GPU, delivering high accuracy for reasoning, tool use, and instruction following.

Price per 1M input tokens

$0.10


Price per 1M output tokens

$0.50


Release Date

03/10/2026


Context Size

262,144


Quantization

bfloat16


# Assume openai>=1.0.0
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B",
    messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
copy

Featured Model: nvidia/Nemotron-3-Nano-30B-A3B

NVIDIA Nemotron 3 Nano is an open small reasoning model optimized for fast, cost-efficient inference in agentic and production workloads. Built with a hybrid Mixture-of-Experts (MoE) and Mamba-Transformer architecture, it delivers strong multi-step reasoning, high token throughput, stable latency with predictable cost, and efficient deployment for agent-based systems. Designed for real-world AI systems where reasoning can generate significantly more tokens per prompt, Nemotron Nano reduces compute cost while maintaining strong reasoning quality.

Price per 1M input tokens

$0.05


Price per 1M output tokens

$0.20


Release Date

12/15/2025


Context Size

262,144


Quantization

fp4


# Assume openai>=1.0.0
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="nvidia/Nemotron-3-Nano-30B-A3B",
    messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
copy

Available Nemotron Models

The Nemotron family spans Nano, Super, and specialized instruct variants, enabling you to balance accuracy, reasoning depth, latency, and cost for your specific workload.

  • Nano for maximum efficiency and stable inference
  • Super for multi-agent systems and advance reasoning
  • Instruct variants for instruction-following and conversational workloads
ModelContext$ per 1M input tokens$ per 1M output tokens
Actions
NVIDIA-Nemotron-3-Super-120B-A12B256k$0.10$0.50
Nemotron-3-Nano-30B-A3B256k$0.05$0.20
Llama-3.3-Nemotron-Super-49B-v1.5128k$0.10$0.40
NVIDIA-Nemotron-Nano-9B-v2128k$0.04$0.16

FAQ

How do I integrate Nemotron models into my application?

You can integrate Nemotron models seamlessly using DeepInfra’s OpenAI-compatible API. Just replace your existing base URL with DeepInfra’s endpoint and use your DeepInfra API key—no infrastructure setup required. DeepInfra also supports integration through libraries like openai, litellm, and other SDKs, making it easy to switch or scale your workloads instantly.

What are the pricing details for using Nemotron models on DeepInfra?

Pricing is usage-based:
  • Input Tokens: between $0.04 and $0.10 per million
  • Output Tokens: between $0.16 and $0.50 per million
Prices vary slightly by model. There are no upfront fees, and you only pay for what you use.

How do I get started using Nemotron on DeepInfra?

Sign in with GitHub at deepinfra.com
  • Get your API key
  • Test models directly from the browser, cURL, or SDKs
  • Review pricing on your usage dashboard
Within minutes, you can deploy apps using Nemotron models—without any infrastructure setup.