We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…
Logo

Fast ML Inference, Simple API

Run the top AI models using a simple API, pay per use. Low cost, scalable and production ready infrastructure.

$0.9 per 1M input tokens


curl -X POST \ -d '{"input": "What is the meaning of life?", "stream": true}' \ -H 'Content-Type: application/json' \ https://api.deepinfra.com/v1/inference/meta-llama/Meta-Llama-3.1-405B-Instruct

Deep Chat

deepseek-ai/DeepSeek-R1-Turbo cover image

DeepSeek R1 Turbo

Ask me anything

0.00s

Featured models:

What we loved, used and implemented the most last month:
deepseek-ai/DeepSeek-R1-0528 cover image
$0.50/$2.15 in/out Mtoken
  • text-generation

The DeepSeek R1 model has undergone a minor version upgrade, with the current version being DeepSeek-R1-0528.

Qwen/Qwen3-235B-A22B cover image
$0.14/$0.60 in/out Mtoken
  • text-generation

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support

Qwen/Qwen3-30B-A3B cover image
$0.08/$0.29 in/out Mtoken
  • text-generation

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support

Qwen/Qwen3-32B cover image
$0.10/$0.30 in/out Mtoken
  • text-generation

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support

Qwen/Qwen3-14B cover image
$0.07/$0.24 in/out Mtoken
  • text-generation

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support.

deepseek-ai/DeepSeek-Prover-V2-671B cover image
$0.50/$2.18 in/out Mtoken
  • text-generation

DeepSeek-Prover-V2, an open-source large language model designed for formal theorem proving in Lean 4, with initialization data collected through a recursive theorem proving pipeline powered by DeepSeek-V3. The cold-start training procedure begins by prompting DeepSeek-V3 to decompose complex problems into a series of subgoals. The proofs of resolved subgoals are synthesized into a chain-of-thought process, combined with DeepSeek-V3's step-by-step reasoning, to create an initial cold start for reinforcement learning.

meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 cover image
$0.16/$0.60 in/out Mtoken
  • text-generation

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Maverick, a 17 billion parameter model with 128 experts

meta-llama/Llama-4-Scout-17B-16E-Instruct cover image
$0.08/$0.30 in/out Mtoken
  • text-generation

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Scout, a 17 billion parameter model with 16 experts

deepseek-ai/DeepSeek-R1-Turbo cover image
$1.00/$3.00 in/out Mtoken
  • text-generation

We introduce DeepSeek-R1, which incorporates cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks.

deepseek-ai/DeepSeek-R1 cover image
$0.45/$2.18 in/out Mtoken
  • text-generation

We introduce DeepSeek-R1, which incorporates cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks.

microsoft/phi-4-reasoning-plus cover image
$0.07/$0.35 in/out Mtoken
  • text-generation

Phi-4-reasoning-plus is a state-of-the-art open-weight reasoning model finetuned from Phi-4 using supervised fine-tuning on a dataset of chain-of-thought traces and reinforcement learning. The supervised fine-tuning dataset includes a blend of synthetic prompts and high-quality filtered data from public domain websites, focused on math, science, and coding skills as well as alignment data for safety and Responsible AI. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning. Phi-4-reasoning-plus has been trained additionally with Reinforcement Learning, hence, it has higher accuracy but generates on average 50% more tokens, thus having higher latency.

meta-llama/Llama-Guard-4-12B cover image
$0.05 / Mtoken
  • text-generation

Llama Guard 4 is a natively multimodal safety classifier with 12 billion parameters trained jointly on text and multiple images. Llama Guard 4 is a dense architecture pruned from the Llama 4 Scout pre-trained model and fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM inputs (prompt classification) and in LLM responses (response classification). It itself acts as an LLM: it generates text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated.

Qwen/QwQ-32B cover image
$0.15/$0.20 in/out Mtoken
  • text-generation

QwQ is the reasoning model of the Qwen series. Compared with conventional instruction-tuned models, QwQ, which is capable of thinking and reasoning, can achieve significantly enhanced performance in downstream tasks, especially hard problems. QwQ-32B is the medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini.

deepseek-ai/DeepSeek-V3-0324 cover image
$0.30/$0.88 in/out Mtoken
  • text-generation

DeepSeek-V3-0324, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token, an improved iteration over DeepSeek-V3.

google/gemma-3-27b-it cover image
$0.10/$0.20 in/out Mtoken
  • text-generation

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3 27B is Google's latest open source model, successor to Gemma 2

google/gemma-3-12b-it cover image
$0.05/$0.10 in/out Mtoken
  • text-generation

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3-12B is Google's latest open source model, successor to Gemma 2

google/gemma-3-4b-it cover image
$0.02/$0.04 in/out Mtoken
  • text-generation

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3-12B is Google's latest open source model, successor to Gemma 2

hexgrad/Kokoro-82M cover image
$0.80 per M characters
  • text-to-speech

Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.

nari-labs/Dia-1.6B cover image
$20.00 per M characters
  • text-to-speech

Dia directly generates highly realistic dialogue from a transcript. You can condition the output on audio, enabling emotion and tone control. The model can also produce nonverbal communications like laughter, coughing, clearing throat, etc.

canopylabs/orpheus-3b-0.1-ft cover image
$7.00 per M characters
  • text-to-speech

Orpheus TTS is a state-of-the-art, Llama-based Speech-LLM designed for high-quality, empathetic text-to-speech generation. This model has been finetuned to deliver human-level speech synthesis, achieving exceptional clarity, expressiveness, and real-time streaming performances.

sesame/csm-1b cover image
$7.00 per M characters
  • text-to-speech

CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.

microsoft/Phi-4-multimodal-instruct cover image
$0.05/$0.10 in/out Mtoken
  • text-generation

Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following: - Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian - Vision: English - Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese

deepseek-ai/DeepSeek-R1-Distill-Llama-70B cover image
$0.10/$0.40 in/out Mtoken
  • text-generation

DeepSeek-R1-Distill-Llama-70B is a highly efficient language model that leverages knowledge distillation to achieve state-of-the-art performance. This model distills the reasoning patterns of larger models into a smaller, more agile architecture, resulting in exceptional results on benchmarks like AIME 2024, MATH-500, and LiveCodeBench. With 70 billion parameters, DeepSeek-R1-Distill-Llama-70B offers a unique balance of accuracy and efficiency, making it an ideal choice for a wide range of natural language processing tasks.

deepseek-ai/DeepSeek-V3 cover image
$0.38/$0.89 in/out Mtoken
  • text-generation

DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2.

meta-llama/Llama-3.3-70B-Instruct-Turbo cover image
$0.07/$0.25 in/out Mtoken
  • text-generation

Llama 3.3-70B Turbo is a highly optimized version of the Llama 3.3-70B model, utilizing FP8 quantization to deliver significantly faster inference speeds with a minor trade-off in accuracy. The model is designed to be helpful, safe, and flexible, with a focus on responsible deployment and mitigating potential risks such as bias, toxicity, and misinformation. It achieves state-of-the-art performance on various benchmarks, including conversational tasks, language translation, and text generation.

meta-llama/Llama-3.3-70B-Instruct cover image
$0.23/$0.40 in/out Mtoken
  • text-generation

Llama 3.3-70B is a multilingual LLM trained on a massive dataset of 15 trillion tokens, fine-tuned for instruction-following and conversational dialogue. The model is designed to be helpful, safe, and flexible, with a focus on responsible deployment and mitigating potential risks such as bias, toxicity, and misinformation. It achieves state-of-the-art performance on various benchmarks, including conversational tasks, language translation, and text generation.

mistralai/Mistral-Small-24B-Instruct-2501 cover image
$0.06/$0.12 in/out Mtoken
  • text-generation

Mistral Small 3 is a 24B-parameter language model optimized for low-latency performance across common AI tasks. Released under the Apache 2.0 license, it features both pre-trained and instruction-tuned versions designed for efficient local deployment. The model achieves 81% accuracy on the MMLU benchmark and performs competitively with larger models like Llama 3.3 70B and Qwen 32B, while operating at three times the speed on equivalent hardware.

microsoft/phi-4 cover image
$0.07/$0.14 in/out Mtoken
  • text-generation

Phi-4 is a model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning.

openai/whisper-large-v3-turbo cover image
$0.00020 / minute
  • automatic-speech-recognition

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation.

View all models

How to deploy Deep Infra in seconds

Powerful, self-serve machine learning platform where you can turn models into scalable APIs in just a few clicks.
Download
Download deepctl

Sign up for Deep Infra account using github or Login using github

Deploy
Deploy a model

Choose among hundreds of the most popular ML models

Production
Call Your Model in Production

Use a simple rest API to call your model.

Rocket

Deepinfra Benefits

Deploy models to production faster and cheaper with our serverless GPUs than developing the infrastructure yourself.
Low Latency
Low Latency
  • Model is deployed in multiple regions

  • Close to the user

  • Fast network

  • Autoscaling

Cost Effective
Cost Effective
  • Share resources

  • Pay per use

  • Simple pricing

Serverless
Serverless
  • No ML Ops needed

  • Better cost efficiency

  • Hassle free ML infrastructure

Simple
Simple
  • No ML Ops needed

  • Better cost efficiency

  • Hassle free ML infrastructure

Auto Scaling
Auto Scaling
  • Fast scaling infrastructure

  • Maintain low latency

  • Scale down when not needed

Run costs

Simple Pricing, Deep Infrastructure

We have different pricing models depending on the model used. Some of our langauge models offer per token pricing. Most other models are billed for inference execution time. With this pricing model, you only pay for what you use. There are no long-term contracts or upfront costs, and you can easily scale up and down as your business needs change.

Token Pricing

$0.9 / 1M input tokens
Llama-3.1-405B-Instruct

ModelContext$ per 1M input tokens$ per 1M output tokens
mixtral-8x7B-chat32k$0.08$0.24
wizardLM-2-8x22B64k$0.50$0.50
Llama-3-8B-Instruct8k$0.03$0.06
Mistral-7B-v332k$0.028$0.054
MythoMax-L2-13b4k$0.065$0.065
Llama-3-70B-Instruct8k$0.30$0.40
Llama-3.1-70B-Instruct128k$0.23$0.40
Llama-3.1-8B-Instruct128k$0.03$0.05
Llama-3.1-405B-Instruct32k$0.80$0.80
LoRA-tuned Llama Models

ModelContext$ per 1M input tokens$ per 1M output tokens
Loading pricing data...

You can deploy your own model on our hardware and pay for uptime. You get dedicated SXM-connected GPUs (for multi-GPU setups), automatic scaling to handle load fluctuations and a very competitive price. Read More

GPUPrice
Nvidia A100 GPU$1.50/GPU-hour
Nvidia H100 GPU$2.40/GPU-hour
Nvidia H200 GPU$3.00/GPU-hour
Deploy
  • Dedicated A100-80GB, H100-80GB & H200-141GB GPUs for your custom LLM needs

  • Billed in minute granularity

  • Invoiced weekly

Dedicated Instances and Clusters

For dedicated instances and DGX H100 clusters with 3.2Tbps bandwidth, please contact us at dedicated@deepinfra.com


ModelContext$ per 1M input tokens
bge-large-en-v1.5512$0.01
bge-base-en-v1.5512$0.005
e5-large-v2512$0.01
e5-base-v2512$0.005
gte-large512$0.01
gte-base512$0.005
Hardware
Hardware

All models run on H100 or A100 GPUs, optimized for inference performance and low latency.

Auto scaling
Auto Scaling

Our system will automatically scale the model to more hardware based on your needs. We limit each account to 200 concurrent requests. If you want more drop us a line

Billing
Billing

You have to add a card or pre-pay or you won't be able to use our services. An invoice is always generated at the beginning of the month, and also throughout the month if you hit your tier invoicing threshold. You can also set a spending limit to avoid surprises.

Usage Tiers

Every user is part of a usage tier. As your usage and your spending goes up, we automatically move you to the next usage tier. Every tier has an invoicing threshold. Once reached an invoice is automatically generated.

TierQualification & Invoicing Threshold
Tier 1$20
Tier 2$100 paid$100
Tier 3$500 paid$500
Tier 4$2,000 paid$2,000
Tier 5$10,000 paid$10,000
;