Machine Learning Models and Infrastructure

Fast ML Inference, Simple API

Run the top AI models using a simple API, pay per use. Low cost, scalable and production ready infrastructure.

Try Chat

Discover models

$2.7 per 1M input tokens

curl -X POST \ -d '{"input": "What is the meaning of life?", "stream": true}' \ -H 'Content-Type: application/json' \ https://api.deepinfra.com/v1/inference/meta-llama/Meta-Llama-3.1-405B-Instruct

Deep Chat

API

Llama 3.1 70b

Ask me anything

0.00s

Model

Featured models:

What we loved, used and implemented the most last month:

$0.00045 / minute

openai/

whisper-large-v3

automatic-speech-recognition

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

$2.70 / Mtoken

meta-llama/

Meta-Llama-3.1-405B-Instruct

text-generation

Meta developed and released the Meta Llama 3.1 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes

$0.52/$0.75 in/out Mtoken

meta-llama/

Meta-Llama-3.1-70B-Instruct

text-generation

Meta developed and released the Meta Llama 3.1 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes

$0.06 / Mtoken

meta-llama/

Meta-Llama-3.1-8B-Instruct

text-generation

Meta developed and released the Meta Llama 3.1 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes

$0.27 / Mtoken

google/

gemma-2-27b-it

text-generation

Gemma is a family of lightweight, state-of-the-art open models from Google. Gemma-2-27B delivers the best performance for its size class, and even offers competitive alternatives to models more than twice its size.

$0.09 / Mtoken

google/

gemma-2-9b-it

text-generation

Gemma is a family of lightweight, state-of-the-art open models from Google. The 9B Gemma 2 model delivers class-leading performance, outperforming Llama 3 8B and other open models in its size category.

$0.59/$0.79 in/out Mtoken

cognitivecomputations/

dolphin-2.9.1-llama-3-70b

text-generation

Dolphin 2.9.1, a fine-tuned Llama-3-70b model. The new model, trained on filtered data, is more compliant but uncensored. It demonstrates improvements in instruction, conversation, coding, and function calling abilities.

$0.59/$0.79 in/out Mtoken

Sao10K/

L3-70B-Euryale-v2.1

text-generation

Euryale 70B v2.1 is a model focused on creative roleplay from Sao10k

$0.52/$0.75 in/out Mtoken

meta-llama/

Meta-Llama-3-70B-Instruct

text-generation

Model Details Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes.

$0.56/$0.77 in/out Mtoken

Qwen/

Qwen2-72B-Instruct

text-generation

The 72 billion parameter Qwen2 excels in language understanding, multilingual capabilities, coding, mathematics, and reasoning.

$0.14 / Mtoken

microsoft/

Phi-3-medium-4k-instruct

text-generation

The Phi-3-Medium-4K-Instruct is a powerful and lightweight language model with 14 billion parameters, trained on high-quality data to excel in instruction following and safety measures. It demonstrates exceptional performance across benchmarks, including common sense, language understanding, and logical reasoning, outperforming models of similar size.

$0.064 / Mtoken

openchat/

openchat-3.6-8b

text-generation

Openchat 3.6 is a LLama-3-8b fine tune that outperforms it on multiple benchmarks.

$0.06 / Mtoken

mistralai/

Mistral-7B-Instruct-v0.3

text-generation

Mistral-7B-Instruct-v0.3 is an instruction-tuned model, next iteration of of Mistral 7B that has larger vocabulary, newer tokenizer and supports function calling.

$0.06 / Mtoken

meta-llama/

Meta-Llama-3-8B-Instruct

text-generation

Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes.

$0.65 / Mtoken

mistralai/

Mixtral-8x22B-Instruct-v0.1

text-generation

This is the instruction fine-tuned version of Mixtral-8x22B - the latest and largest mixture of experts large language model (LLM) from Mistral AI. This state of the art machine learning model uses a mixture 8 of experts (MoE) 22b models. During inference 2 experts are selected. This architecture allows large models to be fast and cheap at inference.

$0.63 / Mtoken

microsoft/

WizardLM-2-8x22B

text-generation

WizardLM-2 8x22B is Microsoft AI's most advanced Wizard model. It demonstrates highly competitive performance compared to those leading proprietary models.

$0.07 / Mtoken

microsoft/

WizardLM-2-7B

text-generation

WizardLM-2 7B is the smaller variant of Microsoft AI's latest Wizard model. It is the fastest and achieves comparable performance with existing 10x larger open-source leading models

$0.24 / Mtoken

mistralai/

Mixtral-8x7B-Instruct-v0.1

text-generation

Mixtral is mixture of expert large language model (LLM) from Mistral AI. This is state of the art machine learning model using a mixture 8 of experts (MoE) 7b models. During inference 2 expers are selected. This architecture allows large models to be fast and cheap at inference. The Mixtral-8x7B outperforms Llama 2 70B on most benchmarks.

$0.59/$0.79 in/out Mtoken

lizpreciatior/

lzlv_70b_fp16_hf

text-generation

A Mythomax/MLewd_13B-style merge of selected 70B models A multi-model merge of several LLaMA2 70B finetunes for roleplaying and creative work. The goal was to create a model that combines creativity with intelligence for an enhanced experience.

$0.34 / Mtoken

llava-hf/

llava-1.5-7b-hf

text-generation

LLaVa is a multimodal model that supports vision and language models combined.

$0.0005 / sec

stability-ai/

sdxl

text-to-image

SDXL consists of an ensemble of experts pipeline for latent diffusion: In a first step, the base model is used to generate (noisy) latents, which are then further processed with a refinement model (available here: https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/) specialized for the final denoising steps. Note that the base model can be used as a standalone module.

$0.010 / Mtoken

BAAI/

bge-large-en-v1.5

embeddings

BGE embedding is a general Embedding Model. It is pre-trained using retromae and trained on large-scale pair data using contrastive learning. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned

View all models

How to deploy Deep Infra in seconds

Powerful, self-serve machine learning platform where you can turn models into scalable APIs in just a few clicks.

Download deepctl

Deploy a model

Choose among hundreds of the most popular ML models

Call Your Model in Production

Use a simple rest API to call your model.

Deepinfra Benefits

Deploy models to production faster and cheaper with our serverless GPUs than developing the infrastructure yourself.

Low Latency

Model is deployed in multiple regions
Close to the user
Fast network
Autoscaling

Cost Effective

Share resources
Pay per use
Simple pricing

Serverless

No ML Ops needed
Better cost efficiency
Hassle free ML infrastructure

Simple

No ML Ops needed
Better cost efficiency
Hassle free ML infrastructure

Auto Scaling

Fast scaling infrastructure
Maintain low latency
Scale down when not needed

Run costs

Simple Pricing, Deep Infrastructure

We have different pricing models depending on the model used. Some of our langauge models offer per token pricing. Most other models are billed for inference execution time. With this pricing model, you only pay for what you use. There are no long-term contracts or upfront costs, and you can easily scale up and down as your business needs change.

Token Pricing

$2.7 / 1M input tokens

Llama-3.1-405B-Instruct

Model	Context	$ per 1M input tokens	$ per 1M output tokens
Llama-3.1-405B-Instruct	32k	$2.70	$2.70
Llama-3.1-70B-Instruct	128k	$0.52	$0.75
Llama-3.1-8B-Instruct	128k	$0.06	$0.06
Llama-3-70B-Instruct	8k	$0.52	$0.75
Qwen2-72b	32k	$0.56	$0.77
Phi-3-medium-4k	4k	$0.14	$0.14
Mistral-7B-v3	32k	$0.06	$0.06
Llama-3-8B-Instruct	8k	$0.06	$0.06
wizardLM-2-8x22B	64k	$0.63	$0.63
WizardLM-2-7B	32k	$0.07	$0.07
mixtral-8x7B-chat	32k	$0.24	$0.24
Lzlv-70b	4k	$0.59	$0.79
OpenChat-3.5	8k	$0.07	$0.07
Mistral-7B	32k	$0.06	$0.06
Mistral-7B-v2	32k	$0.06	$0.06
Qwen2-7b	32k	$0.07	$0.07
MythoMax-L2-13b	4k	$0.10	$0.10
Phind-CodeLlama-34B-v2	4k	$0.60	$0.60

Custom LLMs

You can deploy your own model on our hardware and pay for uptime. You get dedicated SXM-connected GPUs (for multi-GPU setups), automatic scaling to handle load fluctuations and a very competitive price. Read More

GPU	Price
Nvidia A100 GPU	$2.00/GPU-hour
Nvidia H100 GPU	$4.00/GPU-hour

Deploy

Dedicated A100-80GB & H100-80GB GPUs for your custom LLM needs
Billed in minute granularity
Invoiced weekly

Dedicated Instances and Clusters

For dedicated instances and DGX H100 clusters with 3.2Tbps bandwidth, please contact us at dedicated@deepinfra.com

Embeddings Pricing

Model	Context	$ per 1M input tokens
bge-large-en-v1.5	512	$0.01
bge-base-en-v1.5	512	$0.005
e5-large-v2	512	$0.01
e5-base-v2	512	$0.005
gte-large	512	$0.01
gte-base	512	$0.005

Execution Time Pricing

$0.0005/second

$0.03

/minute (55% less than Replicate)

Models that are priced by execution time include SDXL and Whisper.

billed per millisecond of inference execution time
only pay for the inference time not idle time
1 hour free

Hardware

All models run on H100 or A100 GPUs, optimized for inference performance and low latency.

Auto Scaling

Our system will automatically scale the model to more hardware based on your needs. We limit each account to 200 concurrent requests. If you want more drop us a line

Billing

You get $1.80 when you sign up. After you use it up you have to add a card or pre-pay. Invoices are generated at the beginning of the month. You can also set a spending limit to avoid surprises.

;

Latest Models

Phind/

Phind-CodeLlama-34B-v2

openchat/

openchat_3.5

bigcode/

starcoder2-15b

Gryphe/

MythoMax-L2-13b

openai/

whisper-tiny

Featured Models

meta-llama/

Meta-Llama-3.1-70B-Instruct

meta-llama/

Meta-Llama-3.1-8B-Instruct

openchat/

openchat-3.6-8b

Sao10K/

L3-70B-Euryale-v2.1

google/

gemma-2-9b-it

google/

gemma-2-27b-it

Company

Pricing

Docs

Compare

DeepStart

About

Careers

Privacy

Terms