DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

NVIDIA Nemotron 3 Super 120B A12B is available across multiple API providers, and the spread in performance and cost is wide enough to change deployment decisions. Artificial Analysis benchmarks three providers — Lightning AI, CoreWeave, and Nebius — with output speed ranging from 154 to 509 t/s (a 3.3x gap), TTFT spanning 0.98s to 1.94s, and blended pricing from $0.26 to $0.39 per 1M tokens. DeepInfra, covered separately below with independently sourced figures, adds a fourth option worth evaluating — lower blended cost and full function calling support. This breakdown covers what each provider delivers and which workloads they fit.
Nemotron 3 Super — Best APIs
| Provider | Best Fit | Blended ($/1M) | Input ($/1M) | Output ($/1M) | TTFT (s) | Speed (t/s) | Context | JSON | Func |
|---|---|---|---|---|---|---|---|---|---|
| DeepInfra | Lowest blended cost; full function calling + JSON; private endpoints | 0.20 | 0.10 | 0.50 | 1.01s | 459.3 | 262k | Yes | Yes |
| Lightning AI | Best raw throughput + fastest first answer token for reasoning-heavy UX | 0.39 | 0.35 | 0.75 | 1.15s | 509.3 | 256k | Yes | No |
| CoreWeave | Best cost; best TTFT; strong for cost-optimized high-volume workloads | 0.26 | 0.20 | 0.80 | 0.98s | 154.4 | 262k | Yes | Yes |
| Nebius | Balanced option; function calling + competitive blended pricing | 0.36 | 0.30 | 0.90 | 1.94s | 171.6 | 256k | Yes | Yes |
Nemotron 3 Super 120B A12B is an open-weight reasoning model from NVIDIA, released March 2026. It uses a hybrid Mamba-Transformer Mixture-of-Experts architecture with 120B total parameters and 12B active per inference pass. Key architectural features: LatentMoE for improved routing accuracy, Multi-Token Prediction (MTP) layers enabling native speculative decoding without a separate draft model, and NVFP4 pre-training with a BF16 release checkpoint. The maximum context length is up to 1M tokens (provider availability varies — see individual listings). The model is designed for reasoning, tool use, agentic workflows, and long-context instruction following across English, code, and multiple languages. Weights are available under the NVIDIA Nemotron Open Model License, which permits commercial use.
For a full breakdown of the model’s architecture, benchmarks, and design decisions, see the Nemotron 3 Super release post. For context on how it sits within the broader Nemotron family, the Nemotron 3 Nano explainer covers the tradeoffs between model sizes.
Lightning AI — Best Raw Throughput and Fastest First Answer Token
Lightning AI leads the benchmark on both raw output speed (509.3 t/s) and time to first answer token (5.08s) — the latter being the more meaningful latency metric for reasoning models, where the model thinks before responding. For user-facing applications where perceived responsiveness depends on getting an actual answer quickly, that 5.08s figure is the number to focus on. The tradeoff is cost ($0.39/1M blended, highest in the set) and the absence of function calling, which limits it for tool-use and agentic pipelines.
CoreWeave — Best Cost and Lowest TTFT
CoreWeave is the cost leader in the Artificial Analysis benchmark set at $0.26/1M blended, with the lowest input price ($0.20/1M) and the fastest TTFT at 0.98s. Its time to first answer token (13.93s) is the highest of the three — meaning the model takes longer to finish reasoning before output begins, which matters for interactive applications. For cost-optimized high-volume workloads where output verbosity is controlled and TTFT matters more than answer latency, CoreWeave is the strongest option in this benchmark set.
Nebius — Balanced Option with Function Calling
Nebius sits between Lightning AI and CoreWeave on most metrics — mid-pack output speed, the highest TTFT of the three (1.94s), the lowest time to first answer token among providers that support function calling (13.59s), and a $0.36/1M blended price. It is the clearest middle-ground option for teams that need function calling but want better throughput economics than CoreWeave. Output pricing ($0.90/1M) is the highest in the benchmark set, which matters for verbose workloads.
DeepInfra — Lowest Blended Cost with Full Feature Support
DeepInfra is not included in this Artificial Analysis benchmark snapshot, but independently published figures from DeepInfra’s own benchmarking place it as a strong fourth option for production Nemotron 3 Super deployments. The figures below come from DeepInfra’s platform data rather than the same AA crawl used for the three providers above.
On DeepInfra’s own figures, it delivers the lowest blended price of the four providers evaluated here, with output speed competitive with Lightning AI, sub-second class TTFT, and full function calling support — the only provider in this comparison to combine all three. Private endpoint deployment is also available for teams with data isolation requirements. For a detailed cost breakdown across real workload patterns, see the Nemotron 3 Super pricing guide. To get started, visit the Nemotron 3 Super model page on DeepInfra.
1. TTFT vs. Time to First Answer Token
Nemotron 3 Super is a reasoning model — it thinks before it responds. This creates two distinct latency metrics that developers often conflate:
Lightning AI’s 5.08s time to first answer token is dramatically better than CoreWeave’s 13.93s and Nebius’s 13.59s — a gap that matters significantly for interactive applications. For batch or async workloads where answer latency is less visible, TTFT and cost become the more relevant metrics.
2. Function Calling and Agentic Pipelines
Only CoreWeave, Nebius, and DeepInfra support function calling in this comparison set. Lightning AI does not. For agentic workflows where the model invokes external tools, retrieves data, or returns structured action objects, function calling is a hard requirement — not a preference. Lightning AI’s throughput advantage becomes irrelevant if your application depends on tool use. All four providers support JSON mode, so structured output generation is broadly available regardless of provider.
3. Output Verbosity and Cost
Artificial Analysis has noted that Nemotron 3 Super generates unusually high output-token volumes relative to similar open-weight models. With output pricing ranging from $0.50 to $0.90/1M across providers, unconstrained generation length can materially affect your bill. Setting explicit output caps, using structured outputs to reduce rambling, and measuring average output token counts before committing to provider pricing are all worth doing before scaling. For guidance on how to model these costs, see the token math and cost-per-completion guide.
4. Context Window Availability
The model supports up to 1M tokens at the architectural level, but no provider in this benchmark set exposes that full window — context windows range from 256k to 262k depending on provider. For workloads requiring full 1M context, verify availability directly with the provider before building around it.
The right provider for Nemotron 3 Super depends on what your workload prioritizes. Lightning AI leads on raw throughput and first-answer latency but lacks function calling and carries the highest blended price. CoreWeave is the cost leader in the AA benchmark set with the fastest TTFT, but its first-answer latency (13.93s) is the highest of the group. Nebius is the middle-ground option for teams that need function calling and competitive blended pricing with reasonable throughput. DeepInfra, on independently benchmarked figures, adds the lowest blended cost of the four, sub-second class TTFT, and full function calling in one package.
For a full cost comparison across workload types, see the Nemotron 3 Super pricing guide. To evaluate the smaller member of the family for cost-efficient agentic workloads, the Nemotron 3 Nano 30B benchmarks are a useful reference. To get started with Nemotron 3 Super on DeepInfra, visit the model page.
Kimi K2.6 API Benchmarks: Latency, TPS & Cost Analysis (2026)<p>About Kimi K2.6 Kimi K2.6 is an open-source frontier model from Moonshot AI, released on April 20, 2026. It is a native multimodal agentic model built for long-horizon coding, autonomous execution, and swarm-based task orchestration. The model uses a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion activated parameters per token, using […]</p>
GLM-5.1 on DeepInfra: Z.AI’s Agentic Engineering Model<p>Z.AI’s GLM-5.1 scores 58.4 on SWE-Bench Pro — ahead of both Claude Opus 4.6 (57.3) and GPT-5.4 (57.7) on real-world software engineering tasks. It’s the direct successor to GLM-5, designed for agentic engineering: long-horizon coding tasks, terminal operations, and repository-level work. The core design premise is that previous models, including GLM-5, tend to plateau after […]</p>
Reliable JSON-Only Responses with DeepInfra LLMs<p>When large language models are used inside real applications, their role changes fundamentally. Instead of chatting with users, they become infrastructure components: extracting information, transforming text, driving workflows, or powering APIs. In these scenarios, natural language is no longer the desired output. What applications need is structured data — and very often, that structure is […]</p>
© 2026 DeepInfra. All rights reserved.