DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Today we're announcing $107 million in Series B funding to scale DeepInfra's inference cloud and expand our global capacity. The round is co-led by 500 Global and Georges Harik, with participation from A.Capital Ventures, Crescent Cove, Felicis, NVIDIA, Peak6, Samsung Next, Supermicro, and Upper90.
This is a big moment for our team — but more than that, it's a signal about where AI infrastructure is heading. Since our Series A, we've grown the volume of tokens we process by 25x.
When we started DeepInfra nearly four years ago, we had a conviction that wasn't yet obvious: inference, not training, would become the dominant driver of enterprise AI workloads. We're now squarely at that inflection point.
Two shifts are colliding at once. Open-source models are reaching parity with proprietary systems, unlocking a new wave of innovation at a fraction of the cost. And agent-based systems are driving continuous, high-volume token demand — a single agentic task can require 50 to 100+ model calls and run nonstop.
Inference is no longer a thin layer on top of an AI stack. It's the system constraint that will define the majority of workloads. And most cloud platforms simply weren't built for this always-on, distributed reality. That's why we built DeepInfra from the ground up — for better economics, performance, and security on inference workloads specifically.
Serving inference well isn't just a software problem, and it isn't just a hardware problem. It's a full-stack problem. Sustained, high-throughput, low-latency inference requires specialized hardware, purpose-built networking, and inference-optimized software working in concert. General-purpose cloud infrastructure — designed for a mix of workloads with bursty, unpredictable patterns — leaves performance and cost on the table when applied to always-on token generation.
That's the gap DeepInfra was built to close. We co-design across all three layers so the stack behaves predictably under the kinds of workloads agentic AI actually produces.
Our approach comes from years of building and operating distributed systems at global scale (the team behind DeepInfra also built imo, the messenger app used by 200M+ people worldwide). A few things make our platform distinct:
Purpose-built and vertically integrated. We own and operate our GPU infrastructure across eight U.S. data centers, with more locations rolling out globally. Owning the stack from chips to APIs gives us structurally better efficiency and more predictable latency than hyperscalers relying on spot or rented capacity.
Designed for the agentic era. Continuous, high-volume token generation isn't an edge case for us — it's the baseline workload we optimize for.
Collaboration with NVIDIA. We're an early infrastructure collaborator in NVIDIA's open AI ecosystem, supporting Nemotron models, the NemoClaw agent framework, and NVIDIA Dynamo inference software. Early deployment of Blackwell GPUs and upcoming Vera Rubin with Dynamo is unlocking up to 20x improvements in inference cost efficiency.
Enterprise-ready by default. 150+ open-source models through OpenAI-compatible APIs, zero data retention, SOC 2 and ISO 27001 certified — production-grade from day one.
This funding will accelerate three things: expanding our global compute capacity, deepening our developer tooling, and supporting the next generation of open-source and agentic models as they ship.
We're grateful to our investors for backing this thesis, and to the developers, scaleups, and enterprises building on DeepInfra. Production-grade inference is becoming the decisive variable in enterprise AI deployment — and we're just getting started.
If you're building agentic or high-throughput AI workloads, come build with us.
— The DeepInfra Team
© 2026 DeepInfra. All rights reserved.