We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

NVIDIA Nemotron 3 Nano 30B API Benchmarks: Latency & CostLatest article
Published on 2026.04.03 by DeepInfraNVIDIA Nemotron 3 Nano 30B API Benchmarks: Latency & Cost

About NVIDIA Nemotron 3 Nano 30B A3B NVIDIA Nemotron 3 Nano 30B A3B is a large language model trained from scratch by NVIDIA, designed as a unified model for both reasoning and non-reasoning tasks. It is part of the Nemotron 3 family — NVIDIA’s most efficient family of open models, built for agentic AI applications. […]

Recent articles
Qwen3 Coder 480B A35B API Benchmarks: Latency & CostPublished on 2026.04.03 by DeepInfraQwen3 Coder 480B A35B API Benchmarks: Latency & Cost

About Qwen3 Coder 480B A35B Instruct Qwen3 Coder 480B A35B Instruct is a state-of-the-art large language model developed by the Qwen team at Alibaba Cloud, specifically designed for code generation and agentic coding tasks. It is a Mixture-of-Experts (MoE) model with 480 billion total parameters and 35 billion active parameters per inference, enabling high performance […]

Kimi K2 0905 API Benchmarks: Latency, Throughput & CostPublished on 2026.04.03 by hanKimi K2 0905 API Benchmarks: Latency, Throughput & Cost

About Kimi K2 0905 Kimi K2 0905 is a state-of-the-art large language model developed by Moonshot AI, representing a significant advancement in open-weight AI capabilities. This Mixture-of-Experts (MoE) model features 1 trillion total parameters with 32 billion activated parameters per forward pass, making it highly efficient while maintaining frontier-level performance. The model supports a 256k […]

Qwen3.5 0.8B API Benchmarks: Latency, Throughput & CostPublished on 2026.04.03 by hanQwen3.5 0.8B API Benchmarks: Latency, Throughput & Cost

About Qwen3.5 0.8B (Reasoning) Qwen3.5 0.8B is part of Alibaba Cloud’s Qwen3.5 Small Model Series, released on March 2, 2026. Designed under the philosophy of “More Intelligence, Less Compute,” it targets edge devices, mobile phones, and low-latency applications where battery life and memory constraints are critical. It employs an Efficient Hybrid Architecture combining Gated Delta […]

DeepSeek V3.2 API Benchmarks: Latency, Throughput & CostPublished on 2026.04.03 by DeepInfraDeepSeek V3.2 API Benchmarks: Latency, Throughput & Cost

About DeepSeek V3.2 DeepSeek V3.2 is a state-of-the-art large language model that unifies conversational speed and deep reasoning in a single 685B parameter Mixture of Experts (MoE) architecture with 37B parameters activated per token. It is built around three key technical breakthroughs: DeepSeek V3.2 achieved gold-medal performance in the 2025 International Mathematical Olympiad (IMO) and […]

Qwen3.5 2B via DeepInfra: Latency, Throughput & CostPublished on 2026.04.03 by DeepInfraQwen3.5 2B via DeepInfra: Latency, Throughput & Cost

About Qwen3.5 2B (Reasoning) Qwen3.5 2B is a compact 2-billion parameter open-weights model released in March 2026 as part of Alibaba Cloud’s Qwen3.5 Small Model Series. It employs an Efficient Hybrid Architecture combining Gated Delta Networks (a form of linear attention) with sparse Mixture-of-Experts, delivering high-throughput inference with minimal latency overhead — a significant architectural […]

Qwen3.5 4B via DeepInfra: Latency, Throughput & CostPublished on 2026.04.03 by DeepInfraQwen3.5 4B via DeepInfra: Latency, Throughput & Cost

About Qwen3.5 4B (Reasoning) Qwen3.5 4B is a compact 4-billion parameter open-weights model released in March 2026 as part of Alibaba Cloud’s Qwen3.5 Small Model Series. It employs an Efficient Hybrid Architecture combining Gated Delta Networks (a form of linear attention) with sparse Mixture-of-Experts, delivering high-throughput inference with minimal latency overhead — a significant architectural […]