We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Accelerating Reasoning Workflows with Nemotron 3 Nano on DeepInfra
Published on 2025.12.15 by Yessen Kanapin
Accelerating Reasoning Workflows with Nemotron 3 Nano on DeepInfra

We are excited to announce that DeepInfra is an official launch partner for NVIDIA Nemotron 3 Nano, the newest open reasoning model in the Nemotron family. Our goal is to give developers, researchers, and teams the fastest and simplest path to using Nemotron 3 Nano from day one — whether you are building lightweight agents, real-time analytics pipelines, or production-grade reasoning systems. On DeepInfra, Nano runs with zero setup, low latency, and no operational overhead, enabling you to move from idea to deployment in minutes.

With its balance of speed, accuracy, and predictable cost, 3 Nano is designed for real-world reasoning tasks. When paired with DeepInfra's high-efficiency inference platform and usage-based pricing, you can experiment freely, scale seamlessly, and integrate the model into your production workflows using only a few lines of code.

Why Nemotron 3 Nano Is Built for Modern Reasoning Workloads

Nemotron 3 Nano introduces a hybrid architecture that blends Mixture of Experts (MoE) with the efficient Mamba transformer design. Most layers rely on Mamba for high-throughput sequence processing, while a focused subset of expert layers handles heavier reasoning operations. This enables:

  • Stable latency even for complex, multi-step tasks
  • Consistent throughput across large and spiky workloads
  • Improved cost-efficiency for agentic and reasoning-intensive applications

To strengthen its reasoning capabilities, 3 Nano is trained on NVIDIA-curated synthetic reasoning datasets generated from expert models and aligned using reinforcement-learning methods to encourage more human-like thought patterns. Benchmarks results and third-party analysis confirm strong performance across:

  • Mathematics and quantitative reasoning
  • Coding and algebraic problem-solving
  • Scientific analysis
  • Structured, multi-step decision workflows

Benchmark data shown below is based on independent evaluations by Artificial Analysis and is included for reference.

Artificial Analysis Openness Index vs. Artificial Analysis Intelligence Index

Intelligence vs. Output Speed

Source: Artificial Analysis

A key design principle of the Nemotron family including this model is openness: the weights, training data, and training recipes are available to the community. Teams can inspect, customize, or fine tune the model to fit research, product, or enterprise needs. This transparency aligns well with DeepInfra's mission to provide a predictable, developer-centric platform for running high-quality open models.

Flexible Deployment, Immediate Access

Nemotron 3 Nano supports a wide range of deployments—local hardware, cloud platforms, or NVIDIA NIM-based setups. On DeepInfra, the model is available through a fully managed endpoint, giving developers immediate access without navigating infrastructure provisioning or configuration.

Developers can expect:

  • Fast, efficient inference enabled by the hybrid architecture and DeepInfra's low-latency stack
  • Strong reasoning performance thanks to high-quality synthetic training and reinforcement learning alignment
  • Full transparency and adaptability via open weights and reproducible training methods
  • Seamless deployment across local environments, cloud infrastructure, or DeepInfra-managed endpoints

Getting Started with Nemotron 3 Nano on DeepInfra

To explore Nano's capabilities, use our ready-to-use Jupyter notebook. It's the fastest way to get started with working examples you can run immediately.

Quick Start with the Tutorial Notebook

A hands-on guide showing how to run Nano, tune reasoning parameters, use long-context inputs, and build lightweight agentic workflows.

The nemotron-3-nano-tutorial.ipynb notebook walks through:

  • Basic inference with the DeepInfra API
  • Reasoning parameter tuning (temperature, top_p) for different use cases
  • Long-context handling for extended documents

The notebook includes working code snippets you can copy and use immediately.

Enterprise-Grade Security and Privacy

DeepInfra operates with a zero-retention policy. Inputs, outputs, and user data are not stored. The platform is SOC 2 and ISO 27001 certified, following industry best practices for security and privacy. More information is available in our Trust Center.

Start Building

Visit the Nemotron 3 Nano model page on DeepInfra to explore pricing and start inference instantly, or check out our documentation to learn more about the broader model ecosystem and developer resources.

Have questions or need help? Reach out to us at feedback@deepinfra.com, join our Discord, or connect with us on X (@DeepInfra) - we're happy to help.

Related articles
GLM-4.6 API: Get fast first tokens at the best $/M from Deepinfra's API - Deep InfraGLM-4.6 API: Get fast first tokens at the best $/M from Deepinfra's API - Deep Infra<p>GLM-4.6 is a high-capacity, “reasoning”-tuned model that shows up in coding copilots, long-context RAG, and multi-tool agent loops. With this class of workload, provider infrastructure determines perceived speed (first-token time), tail stability, and your unit economics. Using ArtificialAnalysis (AA) provider charts for GLM-4.6 (Reasoning), DeepInfra (FP8) pairs a sub-second Time-to-First-Token (TTFT) (0.51 s) with the [&hellip;]</p>
Qwen3.5 27B API Benchmarks: Latency, Throughput & CostQwen3.5 27B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 27B (Reasoning) Qwen3.5 27B is part of Alibaba Cloud&#8217;s latest-generation foundation model family, released in February 2026. Unlike the Mixture-of-Experts variants in the Qwen3.5 series, the 27B model uses a dense architecture combining Gated Delta Networks and Feed Forward Networks. It achieves strong benchmark scores including MMLU-Pro (86.1%), GPQA Diamond (85.5%), and SWE-bench [&hellip;]</p>
GLM-5 API Benchmarks: Latency, Throughput & CostGLM-5 API Benchmarks: Latency, Throughput & Cost<p>GLM-5 is the latest open-weights reasoning model released by Z AI (Zhipu AI) in February 2026, characterized by high &#8220;thinking token&#8221; usage. It is a Mixture of Experts (MoE) model with 744B total parameters and 40B active parameters, scaling up from GLM-4.5&#8217;s 355B parameters. The model was pre-trained on 28.5T tokens and features a 200K+ [&hellip;]</p>