Simple Pricing, Deep Infrastructure

We have different pricing models depending on the model used. Some of our langauge models offer per token pricing. Most other models are billed for inference execution time. With this pricing model, you only pay for what you use. There are no long-term contracts or upfront costs, and you can easily scale up and down as your business needs change.

Token Pricing

$0.65 / 1M input tokens
Mixtral 8x22b

Mixture of experts

Mixture of expert models split the computations into multiple expert subnetworks providing a strong performance.

ModelContext$ per 1M input tokens$ per 1M output tokens
wizardLM-2-8x22B64k$0.63$0.63
mixtral-8x7B-chat32k$0.24$0.24
Dolphin-2.6-mixtral-8x7b32k$0.24$0.24

7 or 8 billion parameters

The 7B & 8B models are our fastest and best value models but they might not be so precise.

ModelContext$ per 1M input tokens$ per 1M output tokens
Mistral-7B-v332k$0.06$0.06
Llama-3-8B-Instruct8k$0.06$0.06
WizardLM-2-7B32k$0.07$0.07
Gemma-7b8k$0.07$0.07
OpenChat-3.58k$0.07$0.07
Mistral-7B32k$0.06$0.06
Mistral-7B-v232k$0.06$0.06
Qwen2-7b32k$0.07$0.07

13 billion parameters

The 13B models are fine-tuned for a balance between speed and precision.

ModelContext$ per 1M input tokens$ per 1M output tokens
Phi-3-medium-4k4k$0.14$0.14
MythoMax-L2-13b4k$0.10$0.10

34 billion parameters

The 34B models are even more capable and at a balanced price.

ModelContext$ per 1M input tokens$ per 1M output tokens
Phind-CodeLlama-34B-v24k$0.60$0.60

70 billion parameters

The 70B models are our most capable models capable of handling complex tasks but also our most expensive and might be slower to respond.

ModelContext$ per 1M input tokens$ per 1M output tokens
Qwen2-72b32k$0.56$0.77
Lzlv-70b4k$0.59$0.79
Llama-3-70B-Instruct8k$0.52$0.75

You can deploy your own model on our hardware and pay for uptime. You get dedicated SXM-connected GPUs (for multi-GPU setups), automatic scaling to handle load fluctuations and a very competitive price. Read More

GPUPrice
Nvidia A100 GPU$2.00/GPU-hour
Nvidia H100 GPU$4.00/GPU-hour
Deploy
  • Dedicated A100-80GB & H100-80GB GPUs for your custom LLM needs

  • Billed in minute granularity

  • Invoiced weekly

Dedicated Instances and Clusters

For dedicated instances and DGX H100 clusters with 3.2Tbps bandwidth, please contact us at dedicated@deepinfra.com


ModelContext$ per 1M input tokens
bge-large-en-v1.5512$0.01
bge-base-en-v1.5512$0.005
e5-large-v2512$0.01
e5-base-v2512$0.005
gte-large512$0.01
gte-base512$0.005
$0.03
/minute (55% less than Replicate)

Models that are priced by execution time include SDXL and Whisper.


  • billed per millisecond of inference execution time

  • only pay for the inference time not idle time

  • 1 hour free

Hardware
Hardware

All models run on H100 or A100 GPUs, optimized for inference performance and low latency.

Auto scaling
Auto Scaling

Our system will automatically scale the model to more hardware based on your needs. We limit each account to 200 concurrent requests. If you want more drop us a line

Billing
Billing

You get $1.80 when you sign up. After you use it up you have to add a card or pre-pay. Invoices are generated at the beginning of the month. You can also set a spending limit to avoid surprises.