We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

What Is Google TurboQuant and What Does It Mean for Open Source Inference? - Deep InfraPublished on 2026.04.28 by DeepInfraWhat Is Google TurboQuant and What Does It Mean for Open Source Inference? - Deep Infra

In late March 2026, Google Research published a paper that got more attention outside of academic circles than most AI research does. TurboQuant, a new compression algorithm for the key-value cache in large language models, landed with enough noise that Cloudflare CEO Matthew Prince called it Google’s DeepSeek moment. The Silicon Valley Pied Piper comparisons […]

Inference Economics: True AI Costs at ScalePublished on 2026.04.28 by DeepInfraInference Economics: True AI Costs at Scale

Most teams discover their inference economics the same way: a production bill arrives that looks nothing like the number they expected. The per-token price seemed small enough during testing. Then real traffic showed up, agents started chaining calls, RAG pipelines bloated the context window, and suddenly the math looked completely different. Token prices have fallen […]

Introducing NVIDIA Nemotron 3 Nano Omni on DeepInfraPublished on 2026.04.28 by Aray SultanbekovaIntroducing NVIDIA Nemotron 3 Nano Omni on DeepInfra

DeepInfra is an official launch partner for NVIDIA Nemotron 3 Nano Omni, the first multimodal model in the Nemotron 3 family — a single open model that understands images, video, audio, documents, and text in one unified inference pass.

NVIDIA Nemotron 3 Nano 30B API Benchmarks: Latency & CostPublished on 2026.04.03 by DeepInfraNVIDIA Nemotron 3 Nano 30B API Benchmarks: Latency & Cost

About NVIDIA Nemotron 3 Nano 30B A3B NVIDIA Nemotron 3 Nano 30B A3B is a large language model trained from scratch by NVIDIA, designed as a unified model for both reasoning and non-reasoning tasks. It is part of the Nemotron 3 family — NVIDIA’s most efficient family of open models, built for agentic AI applications. […]

Qwen3 Coder 480B A35B API Benchmarks: Latency & CostPublished on 2026.04.03 by DeepInfraQwen3 Coder 480B A35B API Benchmarks: Latency & Cost

About Qwen3 Coder 480B A35B Instruct Qwen3 Coder 480B A35B Instruct is a state-of-the-art large language model developed by the Qwen team at Alibaba Cloud, specifically designed for code generation and agentic coding tasks. It is a Mixture-of-Experts (MoE) model with 480 billion total parameters and 35 billion active parameters per inference, enabling high performance […]

Kimi K2 0905 API Benchmarks: Latency, Throughput & CostPublished on 2026.04.03 by DeepInfraKimi K2 0905 API Benchmarks: Latency, Throughput & Cost

About Kimi K2 0905 Kimi K2 0905 is a state-of-the-art large language model developed by Moonshot AI, representing a significant advancement in open-weight AI capabilities. This Mixture-of-Experts (MoE) model features 1 trillion total parameters with 32 billion activated parameters per forward pass, making it highly efficient while maintaining frontier-level performance. The model supports a 256k […]