DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

NVIDIA’s Nemotron 3 Super 120B A12B is an open-weight large language model released on March 11, 2026. It features 120B total parameters with only 12B active per forward pass, delivering exceptional compute efficiency for complex multi-agent applications such as software development and cybersecurity triaging.
The model uses a hybrid Mamba2-Transformer LatentMoE architecture with Multi-Token Prediction (MTP), projecting tokens into a smaller latent dimension for expert routing and computation. This improves accuracy per byte and delivers over 5x throughput compared to the previous Nemotron Super generation. Notably, it is the first model in the Nemotron 3 family pre-trained using NVFP4 quantization — meaning it learned to be accurate within the constraints of 4-bit arithmetic from the first gradient update, not just at inference time.
Nemotron 3 Super supports a native 1 million token context window and responds to queries by first generating a reasoning trace before concluding with a final response, making it purpose-built for long-running autonomous agents and high-volume workloads such as IT ticket automation.
| Specification | Details |
|---|---|
| Architecture | Mamba2-Transformer Hybrid Latent Mixture of Experts (LatentMoE) with Multi-Token Prediction (MTP) |
| Total Parameters | 120 billion |
| Active Parameters | 12 billion (per inference pass) |
| Context Window | Up to 1 million tokens |
| Training Data | 25 trillion tokens |
| Supported Languages | English, French, German, Italian, Japanese, Spanish, Chinese, plus 43 programming languages |
| Pre-training Cutoff | June 2025 |
| Post-training Cutoff | February 2026 |
NVIDIA Nemotron 3 Super 120B is now available across multiple inference providers — but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.
| Provider | Why Notable | Blended ($/1M) | Speed (t/s) | Latency (TTFT) | Context | Tools |
|---|---|---|---|---|---|---|
| DeepInfra | Best price + strong speed/latency balance; supports function calling | $0.20 | 459.3 | 1.01s | 262k | Yes |
| Baseten | Lowest latency (best TTFT) with near-top speed | $0.41 | 479.9 | 0.56s | 203k | No |
| Lightning AI | Fastest output speed (max throughput) | $0.45 | 498.6 | 1.46s | 256k | No |
| Nebius | High speed; supports JSON mode + function calling | $0.45 | 483.7 | 1.62s | 256k | Yes |
| Weights & Biases | Low latency; supports JSON mode + function calling; low throughput | $0.35 | 144.9 | 0.73s | 262k | Yes |
Based on benchmarks across 5 tracked providers, DeepInfra is the recommended API for production-scale Nemotron 3 Super deployment. At $0.20/1M tokens, it is 55% cheaper than the most expensive providers while delivering 459.3 t/s — within 8% of the fastest option. For the lowest latency, Baseten leads at 0.56s TTFT. For maximum raw throughput, Lightning AI leads at 498.6 t/s.
DeepInfra secures the top spot by dominating the economic efficiency of serving Nemotron 3 Super 120B, while maintaining highly competitive performance across every other metric.
The cost delta — $0.25 per million tokens saved compared to the market mean — makes DeepInfra the only logical choice for production-scale deployments. For most RAG or chat applications, the difference between 498 t/s (Lightning AI) and 459 t/s (DeepInfra) is imperceptible, while the 55% cost advantage compounds significantly at volume.
For applications requiring immediate feedback — such as voice-to-voice agents or highly responsive chat interfaces — Baseten is the technical leader.
Baseten’s 0.56s TTFT beats the closest competitor by 0.17s and delivers a genuinely real-time feel for end-users. However, its pricing ($0.41/1M) is more than double DeepInfra’s, and it lacks support for JSON Mode and Function Calling — limiting its viability for complex agentic workflows.
Lightning AI is purpose-built for generation speed, making it the natural choice for high-volume batch processing jobs.
Lightning AI’s 498.6 t/s is the fastest measured, but the 8.5% speed advantage over DeepInfra does not justify the 125% price premium for most use cases. Combined with the lack of JSON Mode and Function Calling, it is best reserved for offline batch workloads where cost is not a constraint.
Nebius occupies a specific niche for developers requiring both Function Calling and JSON Mode — the only provider besides Weights & Biases to support both.
Nebius is worth the premium ($0.45/1M) only if your application strictly requires both JSON Mode and Function Calling. It delivers solid throughput (483.7 t/s) but suffers from the highest latency in the benchmark (1.62s), making it unsuitable for real-time interfaces.
Weights & Biases presents an unusual performance profile, likely acting as a specialized developer-environment endpoint rather than a production inference backend.
Despite strong latency and full feature support, its throughput bottleneck (144.9 t/s) makes it unsuitable for production traffic. It is best suited for short-context developer testing and evaluation environments.
DeepInfra is the cheapest provider at $0.20 blended per 1M tokens — roughly 55% cheaper than Nebius and Lightning AI, and 50% cheaper than Baseten.
Baseten offers the fastest latency with a TTFT of 0.56s, making it ideal for real-time conversational applications.
Yes, DeepInfra supports Function Calling, making it suitable for agentic workflows. Lightning AI and Baseten currently do not support this feature.
Nebius is worth the premium ($0.45/1M) only if your application strictly requires both JSON Mode and Function Calling with no tolerance for prompt-engineering workarounds.
Nemotron 3 Super uses a unique hybrid Mamba2-Transformer LatentMoE architecture, enabling 120B total parameters with only 12B active per inference. This delivers over 5x throughput compared to the previous Nemotron Super, while supporting a native 1M-token context window for long-running autonomous agents.
For the vast majority of Nemotron 3 Super 120B deployments, DeepInfra is the recommended provider. It offers the market’s lowest price ($0.20/1M), strong throughput (459.3 t/s), viable latency (1.01s), and Function Calling support — all without the significant cost premium of the competition.
Enhancing Open-Source LLMs with Function Calling FeatureWe're excited to announce that the Function Calling feature is now available on DeepInfra. We're offering Mistral-7B and Mixtral-8x7B models with this feature. Other models will be available soon.
LLM models are powerful tools for various tasks. However, they're limited in their ability to per...
Pricing 101: Token Math & Cost-Per-Completion Explained<p>LLM pricing can feel opaque until you translate it into a few simple numbers: input tokens, output tokens, and price per million. Every request you send—system prompt, chat history, RAG context, tool-call JSON—counts as input; everything the model writes back counts as output. Once you know those two counts, the cost of a completion is […]</p>
From Precision to Quantization: A Practical Guide to Faster, Cheaper LLMs<p>Large language models live and die by numbers—literally trillions of them. How finely we store those numbers (their precision) determines how much memory a model needs, how fast it runs, and sometimes how good its answers are. This article walks from the basics to the deep end: we’ll start with how computers even store a […]</p>
© 2026 DeepInfra. All rights reserved.