We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

Documentation

Introduction

DeepInfra makes it easy to run the latest machine learning models in the cloud.

You focus on your application and your business while we do the heavy lifting for you. Servers. GPUs. Scaling. Monitoring. All wrapped up in a simple, scalable and cost-effective inference API. That's what we do.

You just use our API through REST, Python or JavaScript. We even support most of OpenAI APIs making it easy to migrate and benefit of the huge cost savings.

It is easier and cost-effective to run your inference on DeepInfra instead of using your own hardware or other cloud providers. You only pay for what you use. For example in the case of large language models(LLMs) you only pay for the input + output tokens.

You can checkout all available models or deploy your own. We are constantly adding more. DeepInfra is usualy amongst the first to add a new model once it is available.

Checkout the Getting Started for a quick dive.

You can find your authorization tokens, monitor your deployments, usage and logs in the Dashboard

We offer LangChain integration for supported LLMs.

For announcements and tutorials please check our Blog

Quick Start Guide

Unlock the most affordable AI hosting

Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.

Contact Sales Get Started

Latest Models

openchat/

openchat_3.5

Phind/

Phind-CodeLlama-34B-v2

Gryphe/

MythoMax-L2-13b

bigcode/

starcoder2-15b

openai/

whisper-tiny

Featured Models

Qwen/

QwQ-32B

meta-llama/

Llama-4-Scout-17B-16E-Instruct

Qwen/

Qwen3-Coder-480B-A35B-Instruct

Qwen/

Qwen3-14B

Qwen/

Qwen3-30B-A3B

microsoft/

phi-4

Company

Pricing

Docs

Compare

DeepStart

About

Careers

Trust Center

Privacy

Terms

Have questions or need a custom solution?

Contact Sales