We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

Nemotron 3 Nano Omni — the first multimodal model in the Nemotron 3 family, now on DeepInfra!

Kimi K2.5 API Benchmarks: Latency, Throughput & CostPublished on 2026.04.03 by DeepInfraKimi K2.5 API Benchmarks: Latency, Throughput & Cost

About Kimi K2.5 Kimi K2.5 is Moonshot AI’s flagship open-source reasoning model, released in January 2026. It is a native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens. The model features a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion activated parameters. Kimi K2.5 […]

MiniMax-M2.5 API Benchmarks: Latency, Throughput & CostPublished on 2026.04.03 by DeepInfraMiniMax-M2.5 API Benchmarks: Latency, Throughput & Cost

About MiniMax-M2.5 MiniMax-M2.5 is a state-of-the-art open-weights large language model released in February 2026. Built on a 230B-parameter Mixture of Experts (MoE) architecture with approximately 10 billion active parameters per forward pass, it features Lightning Attention and supports a context window of up to 205,000 tokens. The model uses extended chain-of-thought reasoning to work through […]

GLM-5 API Benchmarks: Latency, Throughput & CostPublished on 2026.04.03 by DeepInfraGLM-5 API Benchmarks: Latency, Throughput & Cost

GLM-5 is the latest open-weights reasoning model released by Z AI (Zhipu AI) in February 2026, characterized by high “thinking token” usage. It is a Mixture of Experts (MoE) model with 744B total parameters and 40B active parameters, scaling up from GLM-4.5’s 355B parameters. The model was pre-trained on 28.5T tokens and features a 200K+ […]

Introducing Nemotron 3 Super on DeepInfraPublished on 2026.03.11 by Aray SultanbekovaIntroducing Nemotron 3 Super on DeepInfra

DeepInfra is an official launch partner for NVIDIA Nemotron 3 Super, the latest open model in the Nemotron family, purpose-built for complex multi-agent applications with a 1M token context window and hybrid MoE architecture.

Building Efficient AI Inference on NVIDIA Blackwell PlatformPublished on 2026.02.12 by DeepInfraBuilding Efficient AI Inference on NVIDIA Blackwell Platform

DeepInfra delivers up to 20x cost reductions on NVIDIA Blackwell by combining MoE architectures, NVFP4 quantization, and inference optimizations — with a Latitude case study.

Function Calling in DeepInfra: Extend Your AI with Real-World LogicPublished on 2026.02.02 by DeepInfraFunction Calling in DeepInfra: Extend Your AI with Real-World Logic

Modern large language models (LLMs) are incredibly powerful at understanding and generating text, but until recently they were largely static: they could only respond based on patterns in their training data. Function calling changes that. It lets language models interact with external logic — your own code, APIs, utilities, or business systems — while still […]