$0.70
/ 1M input tokensModel | Context | $ per 1M input tokens | $ per 1M output tokens |
---|---|---|---|
Llama-2-7b-chat | 4k | $0.13 | $0.13 |
Mistral-7B | 32k | $0.13 | $0.13 |
OpenChat-3.5 | 8k | $0.13 | $0.13 |
Llama-2-13b-chat | 4k | $0.22 | $0.22 |
MythoMax-L2-13b | 4k | $0.22 | $0.22 |
mixtral-8x7B-chat | 32k | $0.27 | $0.27 |
Yi-34B-Chat | 4k | $0.60 | $0.60 |
CodeLlama-34b-Instruct | 4k | $0.60 | $0.60 |
Phind-CodeLlama-34B-v2 | 4k | $0.60 | $0.60 |
Llama-2-70b-chat | 4k | $0.70 | $0.90 |
Airoboros-70b | 4k | $0.70 | $0.90 |
Lzlv-70b | 4k | $0.70 | $0.90 |
$0.0005
/secondbilled per millisecond of inference execution time
only pay for the inference time not idle time
1 hour free
Dedicated A100-80GB GPUs for your custom LLM needs
Billed in minute granularity
Invoiced weekly
For dedicated instances and DGX H100 clusters with 3.2Tbps bandwidth, please contact us at dedicated@deepinfra.com
All models run on H100 or A100 GPUs, optimized for inference performance and low latency.
Our system will automatically scale the model to more hardware based on your needs. To eliminate any cold starts you can also reserve GPU memory at $0.04 per GB / hour
Each inference request time is calculated with millisecond precision and added to your account. Once per month we charge you for the time you've used. You can find your current usage in your account.