Run the top AI models using a simple API, pay per use. Low cost, scalable and production ready infrastructure.
text-generation
Llama 3.3-70B is a multilingual LLM trained on a massive dataset of 15 trillion tokens, fine-tuned for instruction-following and conversational dialogue. The model is designed to be helpful, safe, and flexible, with a focus on responsible deployment and mitigating potential risks such as bias, toxicity, and misinformation. It achieves state-of-the-art performance on various benchmarks, including conversational tasks, language translation, and text generation.
text-generation
Llama 3.3-70B Turbo is a highly optimized version of the Llama 3.3-70B model, utilizing FP8 quantization to deliver significantly faster inference speeds with a minor trade-off in accuracy. The model is designed to be helpful, safe, and flexible, with a focus on responsible deployment and mitigating potential risks such as bias, toxicity, and misinformation. It achieves state-of-the-art performance on various benchmarks, including conversational tasks, language translation, and text generation.
text-generation
Phi-4 is a model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning.
text-generation
Meta developed and released the Meta Llama 3.1 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes
text-generation
Meta developed and released the Meta Llama 3.1 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes
text-generation
Meta developed and released the Meta Llama 3.1 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes
text-generation
QwQ is an experimental research model developed by the Qwen Team, designed to advance AI reasoning capabilities. This model embodies the spirit of philosophical inquiry, approaching problems with genuine wonder and doubt. QwQ demonstrates impressive analytical abilities, achieving scores of 65.2% on GPQA, 50.0% on AIME, 90.6% on MATH-500, and 50.0% on LiveCodeBench. With its contemplative approach and exceptional performance on complex problems.
text-generation
Meta developed and released the Meta Llama 3.1 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes
text-generation
Meta developed and released the Meta Llama 3.1 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes
text-generation
Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). It has significant improvements in code generation, code reasoning and code fixing. A more comprehensive foundation for real-world applications such as Code Agents. Not only enhancing coding capabilities but also maintaining its strengths in mathematics and general competencies.
text-generation
Llama-3.1-Nemotron-70B-Instruct is a large language model customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries. This model reaches Arena Hard of 85.0, AlpacaEval 2 LC of 57.6 and GPT-4-Turbo MT-Bench of 8.98, which are known to be predictive of LMSys Chatbot Arena Elo. As of 16th Oct 2024, this model is #1 on all three automatic alignment benchmarks (verified tab for AlpacaEval 2 LC), edging out strong frontier models such as GPT-4o and Claude 3.5 Sonnet.
text-generation
Qwen2.5 is a model pretrained on a large-scale dataset of up to 18 trillion tokens, offering significant improvements in knowledge, coding, mathematics, and instruction following compared to its predecessor Qwen2. The model also features enhanced capabilities in generating long texts, understanding structured data, and generating structured outputs, while supporting multilingual capabilities for over 29 languages.
text-generation
The Llama 90B Vision model is a top-tier, 90-billion-parameter multimodal model designed for the most challenging visual reasoning and language tasks. It offers unparalleled accuracy in image captioning, visual question answering, and advanced image-text comprehension. Pre-trained on vast multimodal datasets and fine-tuned with human feedback, the Llama 90B Vision is engineered to handle the most demanding image-based AI tasks. This model is perfect for industries requiring cutting-edge multimodal AI capabilities, particularly those dealing with complex, real-time visual and textual analysis.
text-generation
Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answering, bridging the gap between language generation and visual reasoning. Pre-trained on a massive dataset of image-text pairs, it performs well in complex, high-accuracy image analysis. Its ability to integrate visual understanding with language processing makes it an ideal solution for industries requiring comprehensive visual-linguistic AI applications, such as content creation, AI-driven customer service, and research.
text-to-image
At 8 billion parameters, with superior quality and prompt adherence, this base model is the most powerful in the Stable Diffusion family. This model is ideal for professional use cases at 1 megapixel resolution
text-to-image
Black Forest Labs' latest state-of-the art proprietary model sporting top of the line prompt following, visual quality, details and output diversity.
text-to-image
FLUX.1 [schnell] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. This model offers cutting-edge output quality and competitive prompt following, matching the performance of closed source alternatives. Trained using latent adversarial diffusion distillation, FLUX.1 [schnell] can generate high-quality images in only 1 to 4 steps.
text-to-image
FLUX.1-dev is a state-of-the-art 12 billion parameter rectified flow transformer developed by Black Forest Labs. This model excels in text-to-image generation, providing highly accurate and detailed outputs. It is particularly well-regarded for its ability to follow complex prompts and generate anatomically accurate images, especially with challenging details like hands and faces.
text-to-image
Black Forest Labs' first flagship model based on Flux latent rectified flow transformers
text-to-image
At 2.5 billion parameters, with improved MMDiT-X architecture and training methods, this model is designed to run “out of the box” on consumer hardware, striking a balance between quality and ease of customization. It is capable of generating images ranging between 0.25 and 2 megapixel resolution.
text-to-speech
Kokoro is a frontier TTS model for its size of 82 million parameters (text in/audio out). On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision under an Apache 2.0 license. As of 2 Jan 2025, 10 unique Voicepacks have been released, and a .onnx version of v0.19 is available.
automatic-speech-recognition
Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation.
automatic-speech-recognition
Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.
automatic-speech-recognition
Distil-Whisper was proposed in the paper Robust Knowledge Distillation via Large-Scale Pseudo Labelling. This is the third and final installment of the Distil-Whisper English series. It the knowledge distilled version of OpenAI's Whisper large-v3, the latest and most performant Whisper model to date. Compared to previous Distil-Whisper models, the distillation procedure for distil-large-v3 has been adapted to give superior long-form transcription accuracy with OpenAI's sequential long-form algorithm.
text-generation
WizardLM-2 8x22B is Microsoft AI's most advanced Wizard model. It demonstrates highly competitive performance compared to those leading proprietary models.
View all models
Sign up for Deep Infra account using github or Login using github
Choose among hundreds of the most popular ML models
Use a simple rest API to call your model.
Model is deployed in multiple regions
Close to the user
Fast network
Autoscaling
Share resources
Pay per use
Simple pricing
No ML Ops needed
Better cost efficiency
Hassle free ML infrastructure
No ML Ops needed
Better cost efficiency
Hassle free ML infrastructure
Fast scaling infrastructure
Maintain low latency
Scale down when not needed
Run costs
Model | Context | $ per 1M input tokens | $ per 1M output tokens |
---|---|---|---|
Llama-3.1-70B-Instruct | 128k | $0.23 | $0.40 |
Llama-3.1-8B-Instruct | 128k | $0.03 | $0.05 |
Llama-3.1-405B-Instruct | 32k | $0.80 | $0.80 |
wizardLM-2-8x22B | 64k | $0.50 | $0.50 |
mixtral-8x7B-chat | 32k | $0.24 | $0.24 |
OpenChat-3.5 | 8k | $0.055 | $0.055 |
Llama-3-8B-Instruct | 8k | $0.03 | $0.06 |
Mistral-7B-v3 | 32k | $0.03 | $0.055 |
MythoMax-L2-13b | 4k | $0.065 | $0.065 |
Llama-3-70B-Instruct | 8k | $0.23 | $0.40 |
Model | Context | $ per 1M input tokens | $ per 1M output tokens |
---|---|---|---|
Loading pricing data... |
You can deploy your own model on our hardware and pay for uptime. You get dedicated SXM-connected GPUs (for multi-GPU setups), automatic scaling to handle load fluctuations and a very competitive price. Read More
GPU | Price |
---|---|
Nvidia A100 GPU | $1.50/GPU-hour |
Nvidia H100 GPU | $2.40/GPU-hour |
Nvidia H200 GPU | $3.00/GPU-hour |
Dedicated A100-80GB, H100-80GB & H200-141GB GPUs for your custom LLM needs
Billed in minute granularity
Invoiced weekly
For dedicated instances and DGX H100 clusters with 3.2Tbps bandwidth, please contact us at dedicated@deepinfra.com
Model | Context | $ per 1M input tokens |
---|---|---|
bge-large-en-v1.5 | 512 | $0.01 |
bge-base-en-v1.5 | 512 | $0.005 |
e5-large-v2 | 512 | $0.01 |
e5-base-v2 | 512 | $0.005 |
gte-large | 512 | $0.01 |
gte-base | 512 | $0.005 |
Models that are priced by execution time include SDXL and Whisper.
billed per millisecond of inference execution time
only pay for the inference time not idle time
1 hour free
All models run on H100 or A100 GPUs, optimized for inference performance and low latency.
Our system will automatically scale the model to more hardware based on your needs. We limit each account to 200 concurrent requests. If you want more drop us a line
You have to add a card or pre-pay or you won't be able to use our services. An invoice is always generated at the beginning of the month, and also throughout the month if you hit your tier invoicing threshold. You can also set a spending limit to avoid surprises.
Every user is part of a usage tier. As your usage and your spending goes up, we automatically move you to the next usage tier. Every tier has an invoicing threshold. Once reached an invoice is automatically generated.
Tier | Qualification & Invoicing Threshold | |
---|---|---|
Tier 1 | $20 | |
Tier 2 | $100 paid | $100 |
Tier 3 | $500 paid | $500 |
Tier 4 | $2,000 paid | $2,000 |
Tier 5 | $10,000 paid | $10,000 |