Logo

Fast ML Inference, Simple API

Run the top AI models using a simple API, pay per use. Low cost, scalable and production ready infrastructure.

$0.65 per 1M input tokens


curl -X POST \ -d '{"input": "What is the meaning of life?", "stream": true}' \ -H 'Content-Type: application/json' \ https://api.deepinfra.com/v1/inference/mistralai/Mixtral-8x22B-Instruct-v0.1

Deep Chat

meta-llama/Meta-Llama-3-70B-Instruct cover image

Llama 3 70b

Ask me anything

0.00s

Featured models:

What we loved, used and implemented the most last month:
meta-llama/Meta-Llama-3-70B-Instruct cover image
$0.59/$0.79 in/out Mtoken
  • text-generation

Model Details Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes.

mistralai/Mixtral-8x22B-Instruct-v0.1 cover image
$0.65 / Mtoken
  • text-generation

This is the instruction fine-tuned version of Mixtral-8x22B - the latest and largest mixture of experts large language model (LLM) from Mistral AI. This state of the art machine learning model uses a mixture 8 of experts (MoE) 22b models. During inference 2 experts are selected. This architecture allows large models to be fast and cheap at inference.

microsoft/WizardLM-2-8x22B cover image
$0.65 / Mtoken
  • text-generation

WizardLM-2 8x22B is Microsoft AI's most advanced Wizard model. It demonstrates highly competitive performance compared to those leading proprietary models.

microsoft/WizardLM-2-7B cover image
$0.07 / Mtoken
  • text-generation

WizardLM-2 7B is the smaller variant of Microsoft AI's latest Wizard model. It is the fastest and achieves comparable performance with existing 10x larger open-source leading models

HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1 cover image
$0.65 / Mtoken
  • text-generation

Zephyr 141B-A35B is an instruction-tuned (assistant) version of Mixtral-8x22B. It was fine-tuned on a mix of publicly available, synthetic datasets. It achieves strong performance on chat benchmarks.

google/gemma-1.1-7b-it cover image
$0.07 / Mtoken
  • text-generation

Gemma is an open-source model designed by Google. This is Gemma 1.1 7B (IT), an update over the original instruction-tuned Gemma release. Gemma 1.1 was trained using a novel RLHF method, leading to substantial gains on quality, coding capabilities, factuality, instruction following and multi-turn conversation quality.

databricks/dbrx-instruct cover image
$0.60 / Mtoken
  • text-generation

DBRX is an open source LLM created by Databricks. It uses mixture-of-experts (MoE) architecture with 132B total parameters of which 36B parameters are active on any input. It outperforms existing open source LLMs like Llama 2 70B and Mixtral-8x7B on standard industry benchmarks for language understanding, programming, math, and logic.

mistralai/Mixtral-8x7B-Instruct-v0.1 cover image
$0.24 / Mtoken
  • text-generation

Mixtral is mixture of expert large language model (LLM) from Mistral AI. This is state of the art machine learning model using a mixture 8 of experts (MoE) 7b models. During inference 2 expers are selected. This architecture allows large models to be fast and cheap at inference. The Mixtral-8x7B outperforms Llama 2 70B on most benchmarks.

mistralai/Mistral-7B-Instruct-v0.2 cover image
$0.07 / Mtoken
  • text-generation

The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is a instruct fine-tuned version of the Mistral-7B-v0.2 generative text model using a variety of publicly available conversation datasets.

meta-llama/Llama-2-70b-chat-hf cover image
$0.64/$0.80 in/out Mtoken
  • text-generation

LLaMa 2 is a collections of LLMs trained by Meta. This is the 70B chat optimized version. This endpoint has per token pricing.

cognitivecomputations/dolphin-2.6-mixtral-8x7b cover image
$0.24 / Mtoken
  • text-generation

The Dolphin 2.6 Mixtral 8x7b model is a finetuned version of the Mixtral-8x7b model, trained on a variety of data including coding data, for 3 days on 4 A100 GPUs. It is uncensored and requires trust_remote_code. The model is very obedient and good at coding, but not DPO tuned. The dataset has been filtered for alignment and bias. The model is compliant with user requests and can be used for various purposes such as generating code or engaging in general chat.

lizpreciatior/lzlv_70b_fp16_hf cover image
$0.59/$0.79 in/out Mtoken
  • text-generation

A Mythomax/MLewd_13B-style merge of selected 70B models A multi-model merge of several LLaMA2 70B finetunes for roleplaying and creative work. The goal was to create a model that combines creativity with intelligence for an enhanced experience.

openchat/openchat_3.5 cover image
$0.10 / Mtoken
  • text-generation

OpenChat is a library of open-source language models that have been fine-tuned with C-RLFT, a strategy inspired by offline reinforcement learning. These models can learn from mixed-quality data without preference labels and have achieved exceptional performance comparable to ChatGPT. The developers of OpenChat are dedicated to creating a high-performance, commercially viable, open-source large language model and are continuously making progress towards this goal.

llava-hf/llava-1.5-7b-hf cover image
$0.34 / Mtoken
  • text-generation

LLaVa is a multimodal model that supports vision and language models combined.

deepinfra/airoboros-70b cover image
$0.70/$0.90 in/out Mtoken
  • text-generation

Latest version of the Airoboros model fine-tunned version of llama-2-70b using the Airoboros dataset. This model is currently running jondurbin/airoboros-l2-70b-2.2.1

stability-ai/sdxl cover image
$0.0005 / sec
  • text-to-image

SDXL consists of an ensemble of experts pipeline for latent diffusion: In a first step, the base model is used to generate (noisy) latents, which are then further processed with a refinement model (available here: https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/) specialized for the final denoising steps. Note that the base model can be used as a standalone module.

meta-llama/Llama-2-7b-chat-hf cover image
$0.07 / Mtoken
  • text-generation

Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 7B fine-tuned model, optimized for dialogue use cases and converted for the Hugging Face Transformers format.

openai/whisper-large cover image
$0.0005 / sec
  • automatic-speech-recognition

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

BAAI/bge-large-en-v1.5 cover image
$0.010 / Mtoken
  • embeddings

BGE embedding is a general Embedding Model. It is pre-trained using retromae and trained on large-scale pair data using contrastive learning. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned

View all models

How to deploy Deep Infra in seconds

Powerful, self-serve machine learning platform where you can turn models into scalable APIs in just a few clicks.
Download
Download deepctl

Sign up for Deep Infra account using github or Login using github

Deploy
Deploy a model

Choose among hundreds of the most popular ML models

Production
Call Your Model in Production

Use a simple rest API to call your model.

Rocket

Deepinfra Benefits

Deploy models to production faster and cheaper with our serverless GPUs than developing the infrastructure yourself.
Low Latency
Low Latency
  • Model is deployed in multiple regions

  • Close to the user

  • Fast network

  • Autoscaling

Cost Effective
Cost Effective
  • Share resources

  • Pay per use

  • Simple pricing

Serverless
Serverless
  • No ML Ops needed

  • Better cost efficiency

  • Hassle free ML infrastructure

Simple
Simple
  • No ML Ops needed

  • Better cost efficiency

  • Hassle free ML infrastructure

Auto Scaling
Auto Scaling
  • Fast scaling infrastructure

  • Maintain low latency

  • Scale down when not needed

Run costs

Simple Pricing, Deep Infrastructure

We have different pricing models depending on the model used. Some of our langauge models offer per token pricing. Most other models are billed for inference execution time. With this pricing model, you only pay for what you use. There are no long-term contracts or upfront costs, and you can easily scale up and down as your business needs change.

Token Pricing

$0.65 / 1M input tokens
Mixtral 8x22b

ModelContext$ per 1M input tokens$ per 1M output tokens
mixtral-8x7B-chat32k$0.24$0.24
mixtral-8x22B64k$0.65$0.65
Zephyr 8x-22b64k$0.65$0.65
dbrx32k$0.60$0.60
Dolphin-2.6-mixtral-8x7b32k$0.24$0.24
OpenChat-3.58k$0.10$0.10
Llama-3-8B-Instruct8k$0.08$0.08
Llama-2-7b-chat4k$0.07$0.07
Mistral-7B32k$0.07$0.07
Mistral-7B-v232k$0.07$0.07
WizardLM-2-7B32k$0.07$0.07
Gemma-7b8k$0.07$0.07
Llama-2-13b-chat4k$0.13$0.13
MythoMax-L2-13b4k$0.13$0.13
Yi-34B-Chat4k$0.60$0.60
CodeLlama-34b-Instruct4k$0.60$0.60
Phind-CodeLlama-34B-v24k$0.60$0.60
Llama-3-70B-Instruct8k$0.59$0.79
Llama-2-70b-chat4k$0.64$0.80
Airoboros-70b4k$0.70$0.90
Lzlv-70b4k$0.59$0.79

Custom LLMs

You can deploy your own model on our hardware and pay for uptime. You get dedicated SXM-connected GPUs (for multi-GPU setups), automatic scaling to handle load fluctuations and a very competitive price. Read More

GPUPrice
Nvidia A100 GPU$2.00/GPU-hour
Nvidia H100 GPU$4.00/GPU-hour
Deploy
  • Dedicated A100-80GB & H100-80GB GPUs for your custom LLM needs

  • Billed in minute granularity

  • Invoiced weekly

Dedicated Instances and Clusters

For dedicated instances and DGX H100 clusters with 3.2Tbps bandwidth, please contact us at dedicated@deepinfra.com

Embeddings Pricing


ModelContext$ per 1M input tokens
bge-large-en-v1.5512$0.01
bge-base-en-v1.5512$0.005
e5-large-v2512$0.01
e5-base-v2512$0.005
gte-large512$0.01
gte-base512$0.005

Execution Time Pricing

$0.0005/second
$0.03
/minute (55% less than Replicate)

Models that are priced by execution time include SDXL and Whisper.


  • billed per millisecond of inference execution time

  • only pay for the inference time not idle time

  • 1 hour free

Hardware
Hardware

All models run on H100 or A100 GPUs, optimized for inference performance and low latency.

Auto scaling
Auto Scaling

Our system will automatically scale the model to more hardware based on your needs. We limit each account to 200 concurrent requests. If you want more drop us a line

Billing
Billing

You get $1.80 when you sign up. After you use it up you have to add a card or pre-pay. Invoices are generated at the beginning of the month. You can also set a spending limit to avoid surprises.

;