Nemotron 3 Nano Omni — the first multimodal model in the Nemotron 3 family, now on DeepInfra!

We are excited to announce that DeepInfra is an official launch partner for NVIDIA Nemotron™ 3 Nano Omni, the first multimodal model in the Nemotron 3 family.
Nemotron 3 Nano Omni is an open multimodal model that handles everything an agent needs to see and hear — images, video, audio, documents, and text — in a single inference pass. It delivers leading multimodal accuracy, ~9x higher throughput than other open omni models with the same interactivity, resulting in lower cost and better scalability.
On DeepInfra, the model is available from day one with zero setup, low latency, and no operational overhead. You can build and scale always-on multimodal sub-agents — for computer use, document intelligence, and audio-video understanding — using only a few lines of code.
Most multimodal agents today are built by bolting a vision model next to a speech model next to an LLM. Every extra inference pass adds latency, every cross-model handoff fragments context, and orchestration and error handling multiply over long-running workflows. Nemotron 3 Nano Omni replaces that approach with a single unified model that sees, hears, reads, and reasons across modalities in one loop.
The model combines unified vision and audio encoders with a hybrid Mixture of Experts (MoE) and Mamba-Transformer backbone — the same architectural foundation as Nemotron 3 Nano, extended to natively understand and reason across images, video, audio, and text, with text output.
On top of this architecture, 3D convolution layers and Efficient Video Sampling (EVS) keep video reasoning cheap across long clips, and a hybrid MoE design activates ~3B of 30B parameters per token — giving Nemotron 3 Nano Omni inference economics closer to a small dense model while holding quality closer to a much larger one.
These design choices enable:
The 256K-token context window is a core part of the model's design. For multimodal agents, this means holding long screen sessions, multi-hour calls, and mixed-media documents in a single reasoning frame — without dropping critical context mid-task.
Nemotron 3 Nano Omni delivers leading scores across a wide range of multimodal benchmarks, including MathVista, Video-MME, OCRv2, CharXiv, ScreenSpot-Pro, MMLongBench-Doc, WorldSense, Daily Omni, MMAU, and VoiceBench. For more details, check out the NVIDIA technical blog.
Like the rest of the Nemotron 3 family, Nemotron 3 Nano Omni is fully open with access to model weights, training datasets, and development recipes. This transparency enables teams to inspect, customize, and fine-tune the model for their domain-specific use cases such as computer-use agents, document intelligence, or multimodal reasoning.
Nemotron 3 Nano Omni is accessible via DeepInfra's OpenAI-compatible API. You can get started in a few lines of code.
Install the client:
pip install openai
Run your first inference (image input):
from openai import OpenAI
client = OpenAI(
api_key="<your-deepinfra-api-key>",
base_url="https://api.deepinfra.com/v1/openai",
)
response = client.chat.completions.create(
model="nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning",
messages=[
{"role": "system", "content": "You are a helpful perception agent."},
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the UI state in this screenshot and suggest the next action for an automation agent."},
{"type": "image_url", "image_url": {"url": "https://example.com/screen.png"}},
],
},
],
)
print(response.choices[0].message.content)
Stream responses (text output):
stream = client.chat.completions.create(
model="nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning",
messages=[
{"role": "user", "content": "Walk me through a multi-step plan to extract structured data from this invoice PDF."}
],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
Streaming applies to response generation (text output). Audio, video, image, and document inputs are supported through the same chat completions endpoint — see the model documentation for the full request schema.
DeepInfra operates with a zero-retention policy. Inputs, outputs, and user data are not stored. The platform is SOC 2 and ISO 27001 certified, following industry best practices for security and privacy. More information is available in our DeepInfra Trust Center.
Visit the Nemotron 3 Nano Omni model page on DeepInfra to explore pricing and start inference instantly. Check out our documentation to learn more about the broader model ecosystem and developer resources.
Have questions or need help? Reach out at feedback@deepinfra.com, join our Discord, or connect with us on X (@DeepInfra) — we're happy to help.
Kimi K2 0905 API from Deepinfra: Practical Speed, Predictable Costs, Built for Devs - Deep Infra<p>Kimi K2 0905 is Moonshot’s long-context Mixture-of-Experts update designed for agentic and coding workflows. With a context window up to ~256K tokens, it can ingest large codebases, multi-file documents, or long conversations and still deliver structured, high-quality outputs. But real-world performance isn’t defined by the model alone—it’s determined by the inference provider that serves it: […]</p>
Use OpenAI API clients with LLaMasGetting started
# create a virtual environment
python3 -m venv .venv
# activate environment in current shell
. .venv/bin/activate
# install openai python client
pip install openai
Choose a model
meta-llama/Llama-2-70b-chat-hf
[meta-llama/L...
How to deploy google/flan-ul2 - simple. (open source ChatGPT alternative)Flan-UL2 is probably the best open source model available right now for chatbots. In this post
we will show you how to get started with it very easily. Flan-UL2 is large -
20B parameters. It is fine tuned version of the UL2 model using Flan dataset.
Because this is quite a large model it is not eas...© 2026 Deep Infra. All rights reserved.