DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Note: The list of supported base models is listed on the same page. If you need a base model that is not listed, please contact us at feedback@deepinfra.com
Rate limit will apply on combined traffic of all LoRA adapter models with the same base model. For example, if you have 2 LoRA adapter models with the same base model, and have rate limit of 200. Those 2 LoRA adapter models combined will have rate limit of 200.
Pricing is 50% higher than base model.
LoRA adapter model speed is lower than base model, because there is additional compute and memory overhead to apply the LoRA adapter. From our benchmarks, the LoRA adapter model speed is about 50-60% slower than base model.
You could merge the LoRA adapter with the base model to reduce the overhead. And use custom deployment, the speed will be close to the base model.
Qwen3 Coder 480B A35B API Benchmarks: Latency & Cost<p>About Qwen3 Coder 480B A35B Instruct Qwen3 Coder 480B A35B Instruct is a state-of-the-art large language model developed by the Qwen team at Alibaba Cloud, specifically designed for code generation and agentic coding tasks. It is a Mixture-of-Experts (MoE) model with 480 billion total parameters and 35 billion active parameters per inference, enabling high performance […]</p>
How to Use OpenClaw with DeepInfra: Setup & Workflow Guide<p>When you first learn how to use OpenClaw, the onboarding flow asks for an API key and points you toward Anthropic or OpenAI. Reasonable starting point. For production agents running dozens of tasks a day, it’s an expensive one. OpenClaw works with any OpenAI-compatible API, so you can swap the default model for an open-weight […]</p>
Step 3.5 Flash API Benchmarks: Latency, Throughput & Cost<p>About Step 3.5 Flash Step 3.5 Flash is an open-weights reasoning model released in February 2026 by StepFun. It leverages a sparse Mixture of Experts (MoE) architecture with 196 billion total parameters and only 11 billion active parameters per token during inference — delivering state-of-the-art performance at a fraction of the cost of dense models. […]</p>
© 2026 DeepInfra. All rights reserved.