curl -X POST \ -d '{"input": "What is the meaning of life?", "stream": true}' \ -H 'Content-Type: application/json' \ https://api.deepinfra.com/v1/inference/meta-llama/Llama-2-70b-chat-hf

Powerful, self-serve machine learning platform where you can turn models into scalable APIs in just a few clicks.
Deploy models to production faster and cheaper with our serverless GPUs than developing the infrastructure yourself.
Low Latency
  • Model is deployed in multiple regions

  • Close to the user

  • Fast network

  • Autoscaling

Cost Effective
  • Share resources

  • Pay per use

  • Simple pricing

Auto Scaling
  • Fast scaling infrastructure

  • Maintain low latency

  • Scale down when not needed

We have different pricing models depending on the model used. Some of our langauge models offer per token pricing. Most other models are billed for inference execution time. With this pricing model, you only pay for what you use. There are no long-term contracts or upfront costs, and you can easily scale up and down as your business needs change.

/ 1M input tokens
/ 1M output tokens (50% less than ChatGPT-3.5 Turbo)

/minute (55% less than Replicate)
Nvidia A100 GPU
  • billed per millisecond of inference execution time

  • only pay for the inference time not idle time

  • 1 hour free

For dedicated instances and DGX H100 clusters with 3.2Tbps bandwidth, please contact us at dedicated@deepinfra.com


All models run on H100 or A100 GPUs, optimized for inference performance and low latency.

Auto Scaling

Our system will automatically scale the model to more hardware based on your needs. To eliminate any cold starts you can also reserve GPU memory at $0.04 per GB / hour

Each inference request time is calculated with millisecond precision and added to your account. Once per month we charge you for the time you've used. You can find your current usage in your account.


