We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

FLUX.2 is live! High-fidelity image generation made simple.

Nemotron Model Family

The Nemotron family is a group of large language models developed by NVIDIA, specifically engineered to excel at generating high-quality synthetic data for training other, more powerful AI models. Unlike models focused solely on end-user chat or content creation, Nemotron's core strength lies in producing diverse and realistic text-based training examples—including question-answer pairs, instructions, and conversations—that are crucial for the "supervised fine-tuning" stage of AI development. By providing a robust toolkit for creating these datasets, Nemotron acts as a powerful "force multiplier" in the AI training pipeline, enabling developers to build more capable and refined specialized models efficiently and at scale, without relying solely on scarce, human-curated data.

Featured Model: nvidia/Nemotron-3-Nano-30B-A3B

NVIDIA Nemotron 3 Nano is an open reasoning model optimized for fast, cost-efficient inference. Built with a hybrid MoE and Mamba architecture and trained on NVIDIA-curated synthetic reasoning data, it delivers strong multi-step reasoning with stable latency and predictable performance for agentic and production workloads.

Price per 1M input tokens

$0.06

Price per 1M output tokens

$0.24

Release Date

12/15/2025

Context Size

262,144

Quantization

bfloat16

# Assume openai>=1.0.0
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="nvidia/Nemotron-3-Nano-30B-A3B",
    messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
copy

Featured Model: nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL

The model is an auto-regressive vision language model that uses an optimized transformer architecture. The model enables multi-image reasoning and video understanding, along with strong document intelligence, visual Q&A and summarization capabilities.

Price per 1M input tokens

$0.20

Price per 1M output tokens

$0.60

Release Date

10/28/2025

Context Size

131,072

Quantization

fp8

# Assume openai>=1.0.0
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL",
    messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
copy

Available Nemotron Models

NVIDIA Nemotron is a family of open models customized for efficiency, accuracy, and specialized workloads.

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
Nemotron-3-Nano-30B-A3B	256k	$0.06	$0.24	View more
NVIDIA-Nemotron-Nano-12B-v2-VL	128k	$0.20	$0.60	View more
Llama-3.1-Nemotron-70B-Instruct	128k	$1.20	$1.20	View more
Llama-3.3-Nemotron-Super-49B-v1.5	128k	$0.10	$0.40	View more
NVIDIA-Nemotron-Nano-9B-v2	128k	$0.04	$0.16	View more

FAQ

How do I integrate Nemotron models into my application?

You can integrate Nemotron models seamlessly using DeepInfra’s OpenAI-compatible API. Just replace your existing base URL with DeepInfra’s endpoint and use your DeepInfra API key—no infrastructure setup required. DeepInfra also supports integration through libraries like openai, litellm, and other SDKs, making it easy to switch or scale your workloads instantly.