Simple Pricing, Deep Infrastructure

We have different pricing models depending on the model used. Some of our langauge models offer per token pricing. Most other models are billed for inference execution time. With this pricing model, you only pay for what you use. There are no long-term contracts or upfront costs, and you can easily scale up and down as your business needs change.

Token Pricing


/ 1M input tokens
/ 1M output tokens (50% less than ChatGPT-3.5 Turbo)

Execution Time Pricing


/minute (55% less than Replicate)
Nvidia A100 GPU
  • billed per millisecond of inference execution time

  • only pay for the inference time not idle time

  • 1 hour free

Dedicated Instances and Clusters

For dedicated instances and DGX H100 clusters with 3.2Tbps bandwidth, please contact us at


All models run on H100 or A100 GPUs, optimized for inference performance and low latency.

Auto scaling
Auto Scaling

Our system will automatically scale the model to more hardware based on your needs. To eliminate any cold starts you can also reserve GPU memory at $0.04 per GB / hour

Auto scaling

Each inference request time is calculated with millisecond precision and added to your account. Once per month we charge you for the time you've used. You can find your current usage in your account.

© 2023 Deep Infra. All rights reserved.

Discord Logo