Fast ML Inference, Simple API

Run the top AI models using a simple API, pay per use. Low cost, scalable and production ready infrastructure.

$0.7 per 1M input tokens

curl -X POST \ -d '{"input": "What is the meaning of life?", "stream": true}' \ -H 'Content-Type: application/json' \ https://api.deepinfra.com/v1/inference/meta-llama/Llama-2-70b-chat-hf

Deep Chat

mistralai/Mixtral-8x7B-Instruct-v0.1 cover image

Mixtral 8x7b



Featured models:

What we loved, used and implemented the most last month:

View all models

How to deploy Deep Infra in seconds

Powerful, self-serve machine learning platform where you can turn models into scalable APIs in just a few clicks.
Download deepctl

Sign up for Deep Infra account using github or Login using github

Deploy a model

Choose among hundreds of the most popular ML models

Call Your Model in Production

Use a simple rest API to call your model.


Deepinfra Benefits

Deploy models to production faster and cheaper with our serverless GPUs than developing the infrastructure yourself.
Low Latency
Low Latency
  • Model is deployed in multiple regions

  • Close to the user

  • Fast network

  • Autoscaling

Cost Effective
Cost Effective
  • Share resources

  • Pay per use

  • Simple pricing

  • No ML Ops needed

  • Better cost efficiency

  • Hassle free ML infrastructure

  • No ML Ops needed

  • Better cost efficiency

  • Hassle free ML infrastructure

Auto Scaling
Auto Scaling
  • Fast scaling infrastructure

  • Maintain low latency

  • Scale down when not needed

Run costs

Simple Pricing, Deep Infrastructure

We have different pricing models depending on the model used. Some of our langauge models offer per token pricing. Most other models are billed for inference execution time. With this pricing model, you only pay for what you use. There are no long-term contracts or upfront costs, and you can easily scale up and down as your business needs change.

Token Pricing


/ 1M input tokens
/ 1M output tokens (50% less than ChatGPT-3.5 Turbo)

Execution Time Pricing


/minute (55% less than Replicate)
Nvidia A100 GPU
  • billed per millisecond of inference execution time

  • only pay for the inference time not idle time

  • 1 hour free

Dedicated Instances and Clusters

For dedicated instances and DGX H100 clusters with 3.2Tbps bandwidth, please contact us at dedicated@deepinfra.com


All models run on H100 or A100 GPUs, optimized for inference performance and low latency.

Auto scaling
Auto Scaling

Our system will automatically scale the model to more hardware based on your needs. To eliminate any cold starts you can also reserve GPU memory at $0.04 per GB / hour

Auto scaling

Each inference request time is calculated with millisecond precision and added to your account. Once per month we charge you for the time you've used. You can find your current usage in your account.


© 2023 Deep Infra. All rights reserved.

Discord Logo