We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Best API Providers for NVIDIA Nemotron 3 Super 120B
Published on 2026.05.25 by DeepInfra
Best API Providers for NVIDIA Nemotron 3 Super 120B

Nemotron 3 Super 120B is available across a growing number of hosted APIs and deployment platforms. At 120B total parameters with 12B active per inference pass, the right provider matters: latency, throughput, and cost vary significantly depending on where you run it. This guide covers the top options by use case — from fully managed APIs to dedicated GPU deployments and no-code routing layers. For a detailed cost breakdown, see the Nemotron 3 Super pricing guide.

Summary of Top Providers by Use Case

Best ForProvider
Best overall value & costDeepInfra
Best for interactive applicationsCoreWeave
Best for latency-critical & voice agentsBaseten
Best for high-volume batch processingLightning AI
Best for complex agentic workflowsNebius
Best for AWS enterprise integrationAmazon Bedrock
Best for flexible deployment optionsQubrid AI
Best for asynchronous workloadsDoubleword
Best for high availability with routing fallbackOpenRouter

Detailed Provider Reviews

DeepInfra

DeepInfra is the recommended option for most production Nemotron 3 Super deployments. It delivers the lowest blended price in the benchmarked set at $0.20 per 1M tokens, with strong output speed (459.3 t/s), competitive TTFT (1.01s), and full support for function calling. The platform runs on bare-metal infrastructure, is typically 50–80% cheaper than major cloud alternatives, and is SOC 2 and ISO 27001 certified. Public and private endpoint deployment are both available.

Key features:

  • Lowest blended price at $0.20/1M tokens; $0.10/1M input, $0.50/1M output
  • 459.3 t/s output speed
  • 1.01s TTFT
  • Function calling and JSON mode supported
  • 262k context window
  • Public and private endpoints; SOC 2 and ISO 27001 certified

For a full breakdown of workload cost scenarios on DeepInfra, see the Nemotron 3 Super pricing guide.

CoreWeave

CoreWeave is highlighted in Artificial Analysis benchmarks for offering competitive sub-second TTFT and low blended pricing. It is a strong fit for real-time inference and cost-sensitive workloads where rapid first response matters.

Key features:

  • $0.26/1M tokens blended price
  • 0.98s TTFT (fastest in the Artificial Analysis benchmark set)
  • 154.4 t/s output speed
  • Function calling and JSON mode supported
  • 262k context window

Baseten

Baseten is purpose-built for latency-critical applications. Its 0.56s TTFT is the fastest measured across benchmarked providers — a meaningful advantage for voice-to-voice agents or any interface where perceived responsiveness depends on getting a first response quickly.

Key features:

  • 0.56s TTFT (fastest across all benchmarked providers)
  • 479.9 t/s output speed
  • $0.41/1M tokens blended price
  • 203k context window

Lightning AI

Lightning AI leads the benchmarked set on raw output speed at 509.3 t/s — the right choice when sustained generation throughput is the primary constraint, such as high-volume batch processing or document generation pipelines.

Key features:

  • 509.3 t/s output speed (fastest in the set)
  • JSON mode supported
  • 256k context window
  • $0.39–0.45/1M tokens blended price depending on benchmark source

Nebius

Nebius provides full support for both JSON mode and function calling at high output speeds, making it a solid fit for developers building structured, multi-step agentic workflows that require reliable tool orchestration.

Key features:

  • JSON mode and function calling both supported
  • Up to 483.7 t/s output speed
  • 256k context window
  • $0.36–0.45/1M tokens blended price

Amazon Bedrock

Amazon Bedrock added Nemotron 3 Super on March 18, 2026, providing fully managed access through a single AWS API — no infrastructure to provision. It is the natural choice for enterprise teams already operating within the AWS ecosystem who need compliance, cross-region routing, and flexible service tiers.

Key features:

  • Access via bedrock-runtime and bedrock-mantle endpoints
  • Client-side and server-side tool calling supported
  • Standard, Priority, Flex, and Reserved service tiers
  • Cross-region routing (Geo and Global Cross-Region)
  • 256k context window, up to 32k output tokens

Qubrid AI

Qubrid AI offers a range of deployment options from simple serverless API access to dedicated GPU VMs and Kubernetes deployments, bridging the gap between managed inference and custom infrastructure.

Key features:

  • Serverless API at $0.10/1M input, $0.50/1M output tokens
  • Dedicated cloud GPU VMs from $1.25/GPU/hr
  • Official Docker images for containerized deployments
  • Production-grade Kubernetes manifests and Helm charts
  • SDKs for Python, JavaScript, Go, and Java

Doubleword

Doubleword focuses on workload flexibility with distinct pricing tiers and a batch processing API for asynchronous inference — useful for teams that want to optimize cost by decoupling generation from real-time latency requirements.

Key features:

  • Standard, Async, and Realtime pricing tiers
  • Batch processing API for asynchronous workloads
  • OpenAI-compatible endpoints
  • 256k context window

OpenRouter

OpenRouter is a unified API routing layer that provides access to Nemotron 3 Super through automatic provider routing and fallback mechanisms. It also offers a free variant (nvidia/nemotron-3-super-120b-a12b:free) with a 1M context window, useful for non-production testing. Current pricing on the paid tier: $0.10/1M input, $0.50/1M output.

Key features:

  • Unified OpenAI-compatible API with automatic provider routing
  • Fallback mechanisms to maximize uptime
  • $0.10/1M input, $0.50/1M output on paid tier
  • Free variant available with 1M token context window
  • 1M token context window (paid tier)

Conclusion

Provider choice for Nemotron 3 Super depends on what your workload actually optimizes for:

  • Production deployments at scale: DeepInfra — lowest blended cost, full function calling, private endpoints
  • Interactive and latency-critical apps: Baseten (0.56s TTFT) or CoreWeave (0.98s TTFT, lowest blended in AA benchmark set)
  • High-volume batch processing: Lightning AI — 509.3 t/s output speed
  • Complex agentic workflows needing JSON + function calling: Nebius
  • AWS enterprise integration: Amazon Bedrock — fully managed, compliant, cross-region
  • Flexible self-hosted or dedicated GPU: Qubrid AI
  • Async batch workloads: Doubleword
  • High availability with routing fallback: OpenRouter

For most production-scale deployments, DeepInfra is the strongest starting point: lowest blended price, full API feature support, and the infrastructure reliability that comes with bare-metal deployment. The API benchmarks for Nemotron 3 Super and the Nemotron 3 Nano explainer are useful companion reads when evaluating the full Nemotron family.

Related articles
Qwen3.5 122B A10B API Benchmarks: Latency, Throughput & CostQwen3.5 122B A10B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 122B A10B Qwen3.5 122B A10B is Alibaba Cloud&#8217;s mid-tier multimodal foundation model, released in February 2026. It is a multimodal vision-language Mixture-of-Experts model supporting text, image, and video inputs, designed for native multimodal agent applications. It features 122 billion total parameters with 10 billion activated per token through a hybrid architecture that integrates [&hellip;]</p>
From Precision to Quantization: A Practical Guide to Faster, Cheaper LLMsFrom Precision to Quantization: A Practical Guide to Faster, Cheaper LLMs<p>Large language models live and die by numbers—literally trillions of them. How finely we store those numbers (their precision) determines how much memory a model needs, how fast it runs, and sometimes how good its answers are. This article walks from the basics to the deep end: we’ll start with how computers even store a [&hellip;]</p>
Kimi K2.6 is Now Available on DeepInfraKimi K2.6 is Now Available on DeepInfra<p>Kimi K2.6 can coordinate up to 300 sub-agents executing 4,000 steps in a single autonomous run — Moonshot AI&#8217;s answer to the gap between what frontier models can do in a chat window and what production agentic systems actually need. Built for long-horizon coding, deep research, and complex orchestration, the model is open source under [&hellip;]</p>