DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Note: The list of supported base models is listed on the same page. If you need a base model that is not listed, please contact us at feedback@deepinfra.com
Rate limit will apply on combined traffic of all LoRA adapter models with the same base model. For example, if you have 2 LoRA adapter models with the same base model, and have rate limit of 200. Those 2 LoRA adapter models combined will have rate limit of 200.
Pricing is 50% higher than base model.
LoRA adapter model speed is lower than base model, because there is additional compute and memory overhead to apply the LoRA adapter. From our benchmarks, the LoRA adapter model speed is about 50-60% slower than base model.
You could merge the LoRA adapter with the base model to reduce the overhead. And use custom deployment, the speed will be close to the base model.
Qwen3.5 27B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 27B (Reasoning) Qwen3.5 27B is part of Alibaba Cloud’s latest-generation foundation model family, released in February 2026. Unlike the Mixture-of-Experts variants in the Qwen3.5 series, the 27B model uses a dense architecture combining Gated Delta Networks and Feed Forward Networks. It achieves strong benchmark scores including MMLU-Pro (86.1%), GPQA Diamond (85.5%), and SWE-bench […]</p>
LLM API Provider Performance KPIs 101: TTFT, Throughput & End-to-End Goals<p>Fast, predictable responses turn a clever demo into a dependable product. If you’re building on an LLM API provider like DeepInfra, three performance ideas will carry you surprisingly far: time-to-first-token (TTFT), throughput, and an explicit end-to-end (E2E) goal that blends speed, reliability, and cost into something users actually feel. This beginner-friendly guide explains each KPI […]</p>
GLM-5.1 on DeepInfra: Z.AI’s Agentic Engineering Model<p>Z.AI’s GLM-5.1 scores 58.4 on SWE-Bench Pro — ahead of both Claude Opus 4.6 (57.3) and GPT-5.4 (57.7) on real-world software engineering tasks. It’s the direct successor to GLM-5, designed for agentic engineering: long-horizon coding tasks, terminal operations, and repository-level work. The core design premise is that previous models, including GLM-5, tend to plateau after […]</p>
© 2026 DeepInfra. All rights reserved.