DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Note: The list of supported base models is listed on the same page. If you need a base model that is not listed, please contact us at feedback@deepinfra.com
Rate limit will apply on combined traffic of all LoRA adapter models with the same base model. For example, if you have 2 LoRA adapter models with the same base model, and have rate limit of 200. Those 2 LoRA adapter models combined will have rate limit of 200.
Pricing is 50% higher than base model.
LoRA adapter model speed is lower than base model, because there is additional compute and memory overhead to apply the LoRA adapter. From our benchmarks, the LoRA adapter model speed is about 50-60% slower than base model.
You could merge the LoRA adapter with the base model to reduce the overhead. And use custom deployment, the speed will be close to the base model.
Step 3.5 Flash API Benchmarks: Latency, Throughput & Cost<p>About Step 3.5 Flash Step 3.5 Flash is an open-weights reasoning model released in February 2026 by StepFun. It leverages a sparse Mixture of Experts (MoE) architecture with 196 billion total parameters and only 11 billion active parameters per token during inference — delivering state-of-the-art performance at a fraction of the cost of dense models. […]</p>
MiniMax-M2.5 API Benchmarks: Latency, Throughput & Cost<p>About MiniMax-M2.5 MiniMax-M2.5 is a state-of-the-art open-weights large language model released in February 2026. Built on a 230B-parameter Mixture of Experts (MoE) architecture with approximately 10 billion active parameters per forward pass, it features Lightning Attention and supports a context window of up to 205,000 tokens. The model uses extended chain-of-thought reasoning to work through […]</p>
Best API Providers for DeepSeek V4 in 2026<p>DeepSeek V4 is available across a range of hosted API providers, each with different pricing, performance, and deployment trade-offs. The model comes in two variants: V4 Pro, a 1.6 trillion total parameter Mixture-of-Experts model with 49 billion active parameters and a 1M token context window, and V4 Flash, a lighter 284B total parameter variant built […]</p>
© 2026 DeepInfra. All rights reserved.