We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

GLM-4.7-Flash API Benchmarks: Latency, Throughput & CostPublished on 2026.04.03 by DeepInfraGLM-4.7-Flash API Benchmarks: Latency, Throughput & Cost

About GLM-4.7-Flash GLM-4.7-Flash is Z.AI’s open-weights reasoning model released in January 2026. Built on a Mixture-of-Experts (MoE) Transformer architecture, it features 30 billion total parameters with only ~3 billion active per inference — making it exceptionally efficient for its capability class. The model is designed as a lightweight, cost-effective alternative to Z.AI’s flagship GLM-4.7, optimized […]

Kimi K2.5 API Benchmarks: Latency, Throughput & CostPublished on 2026.04.03 by DeepInfraKimi K2.5 API Benchmarks: Latency, Throughput & Cost

About Kimi K2.5 Kimi K2.5 is Moonshot AI’s flagship open-source reasoning model, released in January 2026. It is a native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens. The model features a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion activated parameters. Kimi K2.5 […]

MiniMax-M2.5 API Benchmarks: Latency, Throughput & CostPublished on 2026.04.03 by DeepInfraMiniMax-M2.5 API Benchmarks: Latency, Throughput & Cost

About MiniMax-M2.5 MiniMax-M2.5 is a state-of-the-art open-weights large language model released in February 2026. Built on a 230B-parameter Mixture of Experts (MoE) architecture with approximately 10 billion active parameters per forward pass, it features Lightning Attention and supports a context window of up to 205,000 tokens. The model uses extended chain-of-thought reasoning to work through […]

GLM-5 API Benchmarks: Latency, Throughput & CostPublished on 2026.04.03 by DeepInfraGLM-5 API Benchmarks: Latency, Throughput & Cost

GLM-5 is the latest open-weights reasoning model released by Z AI (Zhipu AI) in February 2026, characterized by high “thinking token” usage. It is a Mixture of Experts (MoE) model with 744B total parameters and 40B active parameters, scaling up from GLM-4.5’s 355B parameters. The model was pre-trained on 28.5T tokens and features a 200K+ […]

Introducing Nemotron 3 Super on DeepInfraPublished on 2026.03.11 by Aray SultanbekovaIntroducing Nemotron 3 Super on DeepInfra

DeepInfra is an official launch partner for NVIDIA Nemotron 3 Super, the latest open model in the Nemotron family, purpose-built for complex multi-agent applications with a 1M token context window and hybrid MoE architecture.

Building Efficient AI Inference on NVIDIA Blackwell PlatformPublished on 2026.02.12 by DeepInfraBuilding Efficient AI Inference on NVIDIA Blackwell Platform

DeepInfra delivers up to 20x cost reductions on NVIDIA Blackwell by combining MoE architectures, NVFP4 quantization, and inference optimizations — with a Latitude case study.