DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Nemotron 3 Super 120B is available across a growing number of hosted APIs and deployment platforms. At 120B total parameters with 12B active per inference pass, the right provider matters: latency, throughput, and cost vary significantly depending on where you run it. This guide covers the top options by use case — from fully managed APIs to dedicated GPU deployments and no-code routing layers. For a detailed cost breakdown, see the Nemotron 3 Super pricing guide.
| Best For | Provider |
|---|---|
| Best overall value & cost | DeepInfra |
| Best for interactive applications | CoreWeave |
| Best for latency-critical & voice agents | Baseten |
| Best for high-volume batch processing | Lightning AI |
| Best for complex agentic workflows | Nebius |
| Best for AWS enterprise integration | Amazon Bedrock |
| Best for flexible deployment options | Qubrid AI |
| Best for asynchronous workloads | Doubleword |
| Best for high availability with routing fallback | OpenRouter |
DeepInfra
DeepInfra is the recommended option for most production Nemotron 3 Super deployments. It delivers the lowest blended price in the benchmarked set at $0.20 per 1M tokens, with strong output speed (459.3 t/s), competitive TTFT (1.01s), and full support for function calling. The platform runs on bare-metal infrastructure, is typically 50–80% cheaper than major cloud alternatives, and is SOC 2 and ISO 27001 certified. Public and private endpoint deployment are both available.
Key features:
For a full breakdown of workload cost scenarios on DeepInfra, see the Nemotron 3 Super pricing guide.
CoreWeave
CoreWeave is highlighted in Artificial Analysis benchmarks for offering competitive sub-second TTFT and low blended pricing. It is a strong fit for real-time inference and cost-sensitive workloads where rapid first response matters.
Key features:
Baseten
Baseten is purpose-built for latency-critical applications. Its 0.56s TTFT is the fastest measured across benchmarked providers — a meaningful advantage for voice-to-voice agents or any interface where perceived responsiveness depends on getting a first response quickly.
Key features:
Lightning AI
Lightning AI leads the benchmarked set on raw output speed at 509.3 t/s — the right choice when sustained generation throughput is the primary constraint, such as high-volume batch processing or document generation pipelines.
Key features:
Nebius
Nebius provides full support for both JSON mode and function calling at high output speeds, making it a solid fit for developers building structured, multi-step agentic workflows that require reliable tool orchestration.
Key features:
Amazon Bedrock
Amazon Bedrock added Nemotron 3 Super on March 18, 2026, providing fully managed access through a single AWS API — no infrastructure to provision. It is the natural choice for enterprise teams already operating within the AWS ecosystem who need compliance, cross-region routing, and flexible service tiers.
Key features:
Qubrid AI
Qubrid AI offers a range of deployment options from simple serverless API access to dedicated GPU VMs and Kubernetes deployments, bridging the gap between managed inference and custom infrastructure.
Key features:
Doubleword
Doubleword focuses on workload flexibility with distinct pricing tiers and a batch processing API for asynchronous inference — useful for teams that want to optimize cost by decoupling generation from real-time latency requirements.
Key features:
OpenRouter
OpenRouter is a unified API routing layer that provides access to Nemotron 3 Super through automatic provider routing and fallback mechanisms. It also offers a free variant (nvidia/nemotron-3-super-120b-a12b:free) with a 1M context window, useful for non-production testing. Current pricing on the paid tier: $0.10/1M input, $0.50/1M output.
Key features:
Provider choice for Nemotron 3 Super depends on what your workload actually optimizes for:
For most production-scale deployments, DeepInfra is the strongest starting point: lowest blended price, full API feature support, and the infrastructure reliability that comes with bare-metal deployment. The API benchmarks for Nemotron 3 Super and the Nemotron 3 Nano explainer are useful companion reads when evaluating the full Nemotron family.
Qwen3.5 122B A10B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 122B A10B Qwen3.5 122B A10B is Alibaba Cloud’s mid-tier multimodal foundation model, released in February 2026. It is a multimodal vision-language Mixture-of-Experts model supporting text, image, and video inputs, designed for native multimodal agent applications. It features 122 billion total parameters with 10 billion activated per token through a hybrid architecture that integrates […]</p>
From Precision to Quantization: A Practical Guide to Faster, Cheaper LLMs<p>Large language models live and die by numbers—literally trillions of them. How finely we store those numbers (their precision) determines how much memory a model needs, how fast it runs, and sometimes how good its answers are. This article walks from the basics to the deep end: we’ll start with how computers even store a […]</p>
Kimi K2.6 is Now Available on DeepInfra<p>Kimi K2.6 can coordinate up to 300 sub-agents executing 4,000 steps in a single autonomous run — Moonshot AI’s answer to the gap between what frontier models can do in a chat window and what production agentic systems actually need. Built for long-horizon coding, deep research, and complex orchestration, the model is open source under […]</p>
© 2026 DeepInfra. All rights reserved.