Qwen3-Max-Thinking state-of-the-art reasoning model at your fingertips!

Note: The list of supported base models is listed on the same page. If you need a base model that is not listed, please contact us at feedback@deepinfra.com
Rate limit will apply on combined traffic of all LoRA adapter models with the same base model. For example, if you have 2 LoRA adapter models with the same base model, and have rate limit of 200. Those 2 LoRA adapter models combined will have rate limit of 200.
Pricing is 50% higher than base model.
LoRA adapter model speed is lower than base model, because there is additional compute and memory overhead to apply the LoRA adapter. From our benchmarks, the LoRA adapter model speed is about 50-60% slower than base model.
You could merge the LoRA adapter with the base model to reduce the overhead. And use custom deployment, the speed will be close to the base model.
Accelerating Reasoning Workflows with Nemotron 3 Nano on DeepInfraDeepInfra is an official launch partner for NVIDIA Nemotron 3 Nano, the newest open reasoning model in the Nemotron family. Our goal is to give developers, researchers, and teams the fastest and simplest path to using Nemotron 3 Nano from day one.
Build an OCR-Powered PDF Reader & Summarizer with DeepInfra (Kimi K2)<p>This guide walks you from zero to working: you’ll learn what OCR is (and why PDFs can be tricky), how to turn any PDF—including those with screenshots of tables—into text, and how to let an LLM do the heavy lifting to clean OCR noise, reconstruct tables, and summarize the document. We’ll use DeepInfra’s OpenAI-compatible API […]</p>
Building Efficient AI Inference on NVIDIA Blackwell PlatformDeepInfra delivers up to 20x cost reductions on NVIDIA Blackwell by combining MoE architectures, NVFP4 quantization, and inference optimizations — with a Latitude case study.© 2026 Deep Infra. All rights reserved.