Simple Pricing, Deep Infrastructure

We have different pricing models depending on the model used. Some of our langauge models offer per token pricing. Most other models are billed for inference execution time. With this pricing model, you only pay for what you use. There are no long-term contracts or upfront costs, and you can easily scale up and down as your business needs change.

Token Pricing

$0.70

/ 1M input tokens
$0.90
/ 1M output tokens (50% less than ChatGPT-3.5 Turbo)
Llama-2-70b-chat
ModelContext$ per 1M input tokens$ per 1M output tokens
Llama-2-7b-chat4k$0.13$0.13
Mistral-7B32k$0.13$0.13
OpenChat-3.58k$0.13$0.13
Llama-2-13b-chat4k$0.22$0.22
MythoMax-L2-13b4k$0.22$0.22
mixtral-8x7B-chat32k$0.27$0.27
Yi-34B-Chat4k$0.60$0.60
CodeLlama-34b-Instruct4k$0.60$0.60
Phind-CodeLlama-34B-v24k$0.60$0.60
Llama-2-70b-chat4k$0.70$0.90
Airoboros-70b4k$0.70$0.90
Lzlv-70b4k$0.70$0.90

Execution Time Pricing

$0.0005

/second
$0.03
/minute (55% less than Replicate)
Nvidia A100 GPU
  • billed per millisecond of inference execution time

  • only pay for the inference time not idle time

  • 1 hour free

Custom LLMs

$2.00/GPU-hour
  • Dedicated A100-80GB GPUs for your custom LLM needs

  • Billed in minute granularity

  • Invoiced weekly

Dedicated Instances and Clusters

For dedicated instances and DGX H100 clusters with 3.2Tbps bandwidth, please contact us at dedicated@deepinfra.com

Hardware
Hardware

All models run on H100 or A100 GPUs, optimized for inference performance and low latency.

Auto scaling
Auto Scaling

Our system will automatically scale the model to more hardware based on your needs. To eliminate any cold starts you can also reserve GPU memory at $0.04 per GB / hour

Auto scaling
Billing

Each inference request time is calculated with millisecond precision and added to your account. Once per month we charge you for the time you've used. You can find your current usage in your account.