We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Introducing NVIDIA Nemotron 3 Nano Omni on DeepInfra

Published on 2026.04.28 by Aray Sultanbekova

Introducing NVIDIA Nemotron 3 Nano Omni on DeepInfra

We are excited to announce that DeepInfra is an official launch partner for NVIDIA Nemotron™ 3 Nano Omni, the first multimodal model in the Nemotron 3 family.

Nemotron 3 Nano Omni is an open multimodal model that handles everything an agent needs to see and hear — images, video, audio, documents, and text — in a single inference pass. It delivers leading multimodal accuracy, ~9x higher throughput than other open omni models with the same interactivity, resulting in lower cost and better scalability.

On DeepInfra, the model is available from day one with zero setup, low latency, and no operational overhead. You can build and scale always-on multimodal sub-agents — for computer use, document intelligence, and audio-video understanding — using only a few lines of code.

What Makes Nemotron 3 Nano Omni Different

Most multimodal agents today are built by bolting a vision model next to a speech model next to an LLM. Every extra inference pass adds latency, every cross-model handoff fragments context, and orchestration and error handling multiply over long-running workflows. Nemotron 3 Nano Omni replaces that approach with a single unified model that sees, hears, reads, and reasons across modalities in one loop.

The model combines unified vision and audio encoders with a hybrid Mixture of Experts (MoE) and Mamba-Transformer backbone — the same architectural foundation as Nemotron 3 Nano, extended to natively understand and reason across images, video, audio, and text, with text output.

On top of this architecture, 3D convolution layers and Efficient Video Sampling (EVS) keep video reasoning cheap across long clips, and a hybrid MoE design activates ~3B of 30B parameters per token — giving Nemotron 3 Nano Omni inference economics closer to a small dense model while holding quality closer to a much larger one.

These design choices enable:

Strong multimodal understanding across OCR, vision, audio, and combined audio-video workloads
Efficient video reasoning through temporal-aware perception and optimized video sampling
Higher system efficiency and scalability ~9x for video and ~7x for multi-document use cases
Simplified multimodal pipelines, with a single unified model

The 256K-token context window is a core part of the model's design. For multimodal agents, this means holding long screen sessions, multi-hour calls, and mixed-media documents in a single reasoning frame — without dropping critical context mid-task.

Nemotron 3 Nano Omni delivers leading scores across a wide range of multimodal benchmarks, including MathVista, Video-MME, OCRv2, CharXiv, ScreenSpot-Pro, MMLongBench-Doc, WorldSense, Daily Omni, MMAU, and VoiceBench. For more details, check out the NVIDIA technical blog.

Like the rest of the Nemotron 3 family, Nemotron 3 Nano Omni is fully open with access to model weights, training datasets, and development recipes. This transparency enables teams to inspect, customize, and fine-tune the model for their domain-specific use cases such as computer-use agents, document intelligence, or multimodal reasoning.

Getting Started on DeepInfra

Nemotron 3 Nano Omni is accessible via DeepInfra's OpenAI-compatible API. You can get started in a few lines of code.

Install the client:

pip install openai
copy

Run your first inference (image input):

from openai import OpenAI

client = OpenAI(
    api_key="<your-deepinfra-api-key>",
    base_url="https://api.deepinfra.com/v1/openai",
)

response = client.chat.completions.create(
    model="nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning",
    messages=[
        {"role": "system", "content": "You are a helpful perception agent."},
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe the UI state in this screenshot and suggest the next action for an automation agent."},
                {"type": "image_url", "image_url": {"url": "https://example.com/screen.png"}},
            ],
        },
    ],
)

print(response.choices[0].message.content)
copy

Stream responses (text output):

stream = client.chat.completions.create(
    model="nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning",
    messages=[
        {"role": "user", "content": "Walk me through a multi-step plan to extract structured data from this invoice PDF."}
    ],
    stream=True,
)

for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)
copy

Streaming applies to response generation (text output). Audio, video, image, and document inputs are supported through the same chat completions endpoint — see the model documentation for the full request schema.

Enterprise-Grade Security and Privacy

DeepInfra operates with a zero-retention policy. Inputs, outputs, and user data are not stored. The platform is SOC 2 and ISO 27001 certified, following industry best practices for security and privacy. More information is available in our DeepInfra Trust Center.

Start Building

Visit the Nemotron 3 Nano Omni model page on DeepInfra to explore pricing and start inference instantly. Check out our documentation to learn more about the broader model ecosystem and developer resources.

Have questions or need help? Reach out at feedback@deepinfra.com, join our Discord, or connect with us on X (@DeepInfra) — we're happy to help.

DeepSeek V4 Pro Pricing Guide 2026: Pricing, Providers & Cost Comparison<p>DeepSeek V4 Pro matters because it pushes two levers developers actually care about at the same time: open-weight availability and a very competitive provider market. As of the research here, DeepSeek V4 Pro Max is tracked across six API providers, and five of them cluster at the same blended price of $2.17 per 1M tokens […]</p>

Enhancing Open-Source LLMs with Function Calling FeatureWe're excited to announce that the Function Calling feature is now available on DeepInfra. We're offering Mistral-7B and Mixtral-8x7B models with this feature. Other models will be available soon. LLM models are powerful tools for various tasks. However, they're limited in their ability to per...

DeepSeek V4 Pro Is Now Available on DeepInfra<p>DeepSeek released V4 Pro on April 24, 2026 — a 1.6 trillion-parameter Mixture of Experts model with 49 billion active parameters, a 1-million-token context window, and weights available on Hugging Face under an MIT license. On LiveCodeBench, the V4-Pro-Max reasoning variant scores 93.5 Pass@1, leading every model in the comparison set, including Gemini-3.1-Pro High at […]</p>

View all