We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

NVIDIA Nemotron 3 Super 120B API Benchmarks: Latency & CostPublished on 2026.04.03 by DeepInfraNVIDIA Nemotron 3 Super 120B API Benchmarks: Latency & Cost

About NVIDIA Nemotron 3 Super 120B A12B NVIDIA’s Nemotron 3 Super 120B A12B is an open-weight large language model released on March 11, 2026. It features 120B total parameters with only 12B active per forward pass, delivering exceptional compute efficiency for complex multi-agent applications such as software development and cybersecurity triaging. The model uses a […]

GLM-4.7-Flash API Benchmarks: Latency, Throughput & CostPublished on 2026.04.03 by DeepInfraGLM-4.7-Flash API Benchmarks: Latency, Throughput & Cost

About GLM-4.7-Flash GLM-4.7-Flash is Z.AI’s open-weights reasoning model released in January 2026. Built on a Mixture-of-Experts (MoE) Transformer architecture, it features 30 billion total parameters with only ~3 billion active per inference — making it exceptionally efficient for its capability class. The model is designed as a lightweight, cost-effective alternative to Z.AI’s flagship GLM-4.7, optimized […]

Kimi K2.5 API Benchmarks: Latency, Throughput & CostPublished on 2026.04.03 by DeepInfraKimi K2.5 API Benchmarks: Latency, Throughput & Cost

About Kimi K2.5 Kimi K2.5 is Moonshot AI’s flagship open-source reasoning model, released in January 2026. It is a native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens. The model features a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion activated parameters. Kimi K2.5 […]

MiniMax-M2.5 API Benchmarks: Latency, Throughput & CostPublished on 2026.04.03 by DeepInfraMiniMax-M2.5 API Benchmarks: Latency, Throughput & Cost

About MiniMax-M2.5 MiniMax-M2.5 is a state-of-the-art open-weights large language model released in February 2026. Built on a 230B-parameter Mixture of Experts (MoE) architecture with approximately 10 billion active parameters per forward pass, it features Lightning Attention and supports a context window of up to 205,000 tokens. The model uses extended chain-of-thought reasoning to work through […]

GLM-5 API Benchmarks: Latency, Throughput & CostPublished on 2026.04.03 by DeepInfraGLM-5 API Benchmarks: Latency, Throughput & Cost

GLM-5 is the latest open-weights reasoning model released by Z AI (Zhipu AI) in February 2026, characterized by high “thinking token” usage. It is a Mixture of Experts (MoE) model with 744B total parameters and 40B active parameters, scaling up from GLM-4.5’s 355B parameters. The model was pre-trained on 28.5T tokens and features a 200K+ […]

Introducing Nemotron 3 Super on DeepInfraPublished on 2026.03.11 by Aray SultanbekovaIntroducing Nemotron 3 Super on DeepInfra

DeepInfra is an official launch partner for NVIDIA Nemotron 3 Super, the latest open model in the Nemotron family, purpose-built for complex multi-agent applications with a 1M token context window and hybrid MoE architecture.