Logo

Fast ML Inference, Simple API

Run the top AI models using a simple API, pay per use. Low cost, scalable and production ready infrastructure.

$1.79 per 1M input tokens


curl -X POST \ -d '{"input": "What is the meaning of life?", "stream": true}' \ -H 'Content-Type: application/json' \ https://api.deepinfra.com/v1/inference/meta-llama/Meta-Llama-3.1-405B-Instruct

Deep Chat

meta-llama/Meta-Llama-3.1-70B-Instruct cover image

Llama 3.1 70b

Ask me anything

0.00s

Featured models:

What we loved, used and implemented the most last month:
meta-llama/Meta-Llama-3.1-405B-Instruct cover image
$1.79 / Mtoken
  • text-generation

Meta developed and released the Meta Llama 3.1 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes

meta-llama/Llama-3.2-11B-Vision-Instruct cover image
$0.055 / Mtoken
  • text-generation

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answering, bridging the gap between language generation and visual reasoning. Pre-trained on a massive dataset of image-text pairs, it performs well in complex, high-accuracy image analysis. Its ability to integrate visual understanding with language processing makes it an ideal solution for industries requiring comprehensive visual-linguistic AI applications, such as content creation, AI-driven customer service, and research.

meta-llama/Llama-3.2-90B-Vision-Instruct cover image
$0.35/$0.40 in/out Mtoken
  • text-generation

The Llama 90B Vision model is a top-tier, 90-billion-parameter multimodal model designed for the most challenging visual reasoning and language tasks. It offers unparalleled accuracy in image captioning, visual question answering, and advanced image-text comprehension. Pre-trained on vast multimodal datasets and fine-tuned with human feedback, the Llama 90B Vision is engineered to handle the most demanding image-based AI tasks. This model is perfect for industries requiring cutting-edge multimodal AI capabilities, particularly those dealing with complex, real-time visual and textual analysis.

Qwen/Qwen2.5-72B-Instruct cover image
$0.35/$0.40 in/out Mtoken
  • text-generation

Qwen2.5 is a model pretrained on a large-scale dataset of up to 18 trillion tokens, offering significant improvements in knowledge, coding, mathematics, and instruction following compared to its predecessor Qwen2. The model also features enhanced capabilities in generating long texts, understanding structured data, and generating structured outputs, while supporting multilingual capabilities for over 29 languages.

meta-llama/Meta-Llama-3.1-70B-Instruct cover image
$0.35/$0.40 in/out Mtoken
  • text-generation

Meta developed and released the Meta Llama 3.1 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes

meta-llama/Meta-Llama-3.1-8B-Instruct cover image
$0.055 / Mtoken
  • text-generation

Meta developed and released the Meta Llama 3.1 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes

meta-llama/Llama-3.2-3B-Instruct cover image
$0.03/$0.05 in/out Mtoken
  • text-generation

The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out)

meta-llama/Llama-3.2-1B-Instruct cover image
$0.01/$0.02 in/out Mtoken
  • text-generation

The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out).

mistralai/Mistral-Nemo-Instruct-2407 cover image
$0.13 / Mtoken
  • text-generation

12B model trained jointly by Mistral AI and NVIDIA, it significantly outperforms existing models smaller or similar in size.

black-forest-labs/FLUX-1-dev cover image
$0.02 x (width / 1024) x (height / 1024) x (iters / 25)
  • text-to-image

FLUX.1-dev is a state-of-the-art 12 billion parameter rectified flow transformer developed by Black Forest Labs. This model excels in text-to-image generation, providing highly accurate and detailed outputs. It is particularly well-regarded for its ability to follow complex prompts and generate anatomically accurate images, especially with challenging details like hands and faces.

black-forest-labs/FLUX-1-schnell cover image
$0.0005 x (width / 1024) x (height / 1024) x iters
  • text-to-image

FLUX.1 [schnell] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. This model offers cutting-edge output quality and competitive prompt following, matching the performance of closed source alternatives. Trained using latent adversarial diffusion distillation, FLUX.1 [schnell] can generate high-quality images in only 1 to 4 steps.

stabilityai/sdxl-turbo cover image
$0.0002 x (width / 1024) x (height / 1024) x (iters / 5)
  • text-to-image

The SDXL Turbo model, developed by Stability AI, is an optimized, fast text-to-image generative model. It is a distilled version of SDXL 1.0, leveraging Adversarial Diffusion Distillation (ADD) to generate high-quality images in less steps.

black-forest-labs/FLUX-1.1-pro cover image
$0.04 / img
  • text-to-image

Black Forest Labs' latest state-of-the art proprietary model sporting top of the line prompt following, visual quality, details and output diversity.

black-forest-labs/FLUX-pro cover image
$0.05 / img
  • text-to-image

Black Forest Labs' first flagship model based on Flux latent rectified flow transformers

openai/whisper-large-v3-turbo cover image
$0.00020 / minute
  • automatic-speech-recognition

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation.

openai/whisper-large-v3 cover image
$0.00045 / minute
  • automatic-speech-recognition

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

distil-whisper/distil-large-v3 cover image
$0.00018 / minute
  • automatic-speech-recognition

Distil-Whisper was proposed in the paper Robust Knowledge Distillation via Large-Scale Pseudo Labelling. This is the third and final installment of the Distil-Whisper English series. It the knowledge distilled version of OpenAI's Whisper large-v3, the latest and most performant Whisper model to date. Compared to previous Distil-Whisper models, the distillation procedure for distil-large-v3 has been adapted to give superior long-form transcription accuracy with OpenAI's sequential long-form algorithm.

deepinfra/tts cover image
$5.00 per M characters
  • custom

Text-to-Speech (TTS) technology converts written text into spoken words using advanced speech synthesis. TTS systems are used in applications like virtual assistants, accessibility tools for visually impaired users, and language learning software, enabling seamless human-computer interaction.

Qwen/Qwen2.5-Coder-7B cover image
$0.055 / Mtoken
  • text-generation

Qwen2.5-Coder-7B is a powerful code-specific large language model with 7.61 billion parameters. It's designed for code generation, reasoning, and fixing tasks. The model covers 92 programming languages and has been trained on 5.5 trillion tokens of data, including source code, text-code grounding, and synthetic data.

google/gemma-2-27b-it cover image
$0.27 / Mtoken
  • text-generation

Gemma is a family of lightweight, state-of-the-art open models from Google. Gemma-2-27B delivers the best performance for its size class, and even offers competitive alternatives to models more than twice its size.

google/gemma-2-9b-it cover image
$0.06 / Mtoken
  • text-generation

Gemma is a family of lightweight, state-of-the-art open models from Google. The 9B Gemma 2 model delivers class-leading performance, outperforming Llama 3 8B and other open models in its size category.

Sao10K/L3-70B-Euryale-v2.1 cover image
$0.35/$0.40 in/out Mtoken
  • text-generation

Euryale 70B v2.1 is a model focused on creative roleplay from Sao10k

meta-llama/Meta-Llama-3-70B-Instruct cover image
$0.35/$0.40 in/out Mtoken
  • text-generation

Model Details Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes.

mistralai/Mistral-7B-Instruct-v0.3 cover image
$0.055 / Mtoken
  • text-generation

Mistral-7B-Instruct-v0.3 is an instruction-tuned model, next iteration of of Mistral 7B that has larger vocabulary, newer tokenizer and supports function calling.

meta-llama/Meta-Llama-3-8B-Instruct cover image
$0.055 / Mtoken
  • text-generation

Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes.

microsoft/WizardLM-2-8x22B cover image
$0.50 / Mtoken
  • text-generation

WizardLM-2 8x22B is Microsoft AI's most advanced Wizard model. It demonstrates highly competitive performance compared to those leading proprietary models.

microsoft/WizardLM-2-7B cover image
$0.055 / Mtoken
  • text-generation

WizardLM-2 7B is the smaller variant of Microsoft AI's latest Wizard model. It is the fastest and achieves comparable performance with existing 10x larger open-source leading models

mistralai/Mixtral-8x7B-Instruct-v0.1 cover image
$0.24 / Mtoken
  • text-generation

Mixtral is mixture of expert large language model (LLM) from Mistral AI. This is state of the art machine learning model using a mixture 8 of experts (MoE) 7b models. During inference 2 expers are selected. This architecture allows large models to be fast and cheap at inference. The Mixtral-8x7B outperforms Llama 2 70B on most benchmarks.

lizpreciatior/lzlv_70b_fp16_hf cover image
$0.35/$0.40 in/out Mtoken
  • text-generation

A Mythomax/MLewd_13B-style merge of selected 70B models A multi-model merge of several LLaMA2 70B finetunes for roleplaying and creative work. The goal was to create a model that combines creativity with intelligence for an enhanced experience.

BAAI/bge-large-en-v1.5 cover image
$0.010 / Mtoken
  • embeddings

BGE embedding is a general Embedding Model. It is pre-trained using retromae and trained on large-scale pair data using contrastive learning. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned

View all models

How to deploy Deep Infra in seconds

Powerful, self-serve machine learning platform where you can turn models into scalable APIs in just a few clicks.
Download
Download deepctl

Sign up for Deep Infra account using github or Login using github

Deploy
Deploy a model

Choose among hundreds of the most popular ML models

Production
Call Your Model in Production

Use a simple rest API to call your model.

Rocket

Deepinfra Benefits

Deploy models to production faster and cheaper with our serverless GPUs than developing the infrastructure yourself.
Low Latency
Low Latency
  • Model is deployed in multiple regions

  • Close to the user

  • Fast network

  • Autoscaling

Cost Effective
Cost Effective
  • Share resources

  • Pay per use

  • Simple pricing

Serverless
Serverless
  • No ML Ops needed

  • Better cost efficiency

  • Hassle free ML infrastructure

Simple
Simple
  • No ML Ops needed

  • Better cost efficiency

  • Hassle free ML infrastructure

Auto Scaling
Auto Scaling
  • Fast scaling infrastructure

  • Maintain low latency

  • Scale down when not needed

Run costs

Simple Pricing, Deep Infrastructure

We have different pricing models depending on the model used. Some of our langauge models offer per token pricing. Most other models are billed for inference execution time. With this pricing model, you only pay for what you use. There are no long-term contracts or upfront costs, and you can easily scale up and down as your business needs change.

Token Pricing

$1.79 / 1M input tokens
Llama-3.1-405B-Instruct

ModelContext$ per 1M input tokens$ per 1M output tokens
Llama-3.1-70B-Instruct128k$0.35$0.40
Llama-3.1-8B-Instruct128k$0.055$0.055
Llama-3-70B-Instruct8k$0.35$0.40
Mistral-7B-v332k$0.055$0.055
Llama-3-8B-Instruct8k$0.055$0.055
wizardLM-2-8x22B64k$0.50$0.50
WizardLM-2-7B32k$0.055$0.055
mixtral-8x7B-chat32k$0.24$0.24
Lzlv-70b4k$0.35$0.40
OpenChat-3.58k$0.055$0.055
MythoMax-L2-13b4k$0.10$0.10
Llama-3.1-405B-Instruct32k$1.79$1.79

You can deploy your own model on our hardware and pay for uptime. You get dedicated SXM-connected GPUs (for multi-GPU setups), automatic scaling to handle load fluctuations and a very competitive price. Read More

GPUPrice
Nvidia A100 GPU$2.00/GPU-hour
Nvidia H100 GPU$4.00/GPU-hour
Deploy
  • Dedicated A100-80GB & H100-80GB GPUs for your custom LLM needs

  • Billed in minute granularity

  • Invoiced weekly

Dedicated Instances and Clusters

For dedicated instances and DGX H100 clusters with 3.2Tbps bandwidth, please contact us at dedicated@deepinfra.com


ModelContext$ per 1M input tokens
bge-large-en-v1.5512$0.01
bge-base-en-v1.5512$0.005
e5-large-v2512$0.01
e5-base-v2512$0.005
gte-large512$0.01
gte-base512$0.005
$0.03
/minute (55% less than Replicate)

Models that are priced by execution time include SDXL and Whisper.


  • billed per millisecond of inference execution time

  • only pay for the inference time not idle time

  • 1 hour free

Hardware
Hardware

All models run on H100 or A100 GPUs, optimized for inference performance and low latency.

Auto scaling
Auto Scaling

Our system will automatically scale the model to more hardware based on your needs. We limit each account to 200 concurrent requests. If you want more drop us a line

Billing
Billing

You get $1.80 when you sign up. After you use it up you have to add a card or pre-pay or you won't be able to use our services. An invoice is always generated at the beginning of the month, and also throughout the month if you hit your tier invoicing threshold. You can also set a spending limit to avoid surprises.

Usage Tiers

Every user is part of a usage tier. As your usage and your spending goes up, we automatically move you to the next usage tier. Every tier has an invoicing threshold. Once reached an invoice is automatically generated.

TierQualification & Invoicing Threshold
Tier 1$20
Tier 2$100 paid and 7+ days since payment$50
Tier 3$500 paid and 7+ days since payment$250
Tier 4$1,500 paid and 14+ days since payment$1,000
Tier 5$10,000 paid and 30+ days since payment$5,000
;