We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

🚀 New models by Bria.ai, generate and edit images at scale 🚀

Simple Pricing, Deep Infrastructure

We have different pricing models depending on the model used. Some of our langauge models offer per token pricing. Most other models are billed for inference execution time. With this pricing model, you only pay for what you use. There are no long-term contracts or upfront costs, and you can easily scale up and down as your business needs change.

Contact Sales

Token Pricing

$0.27

/ 1M input tokens

$0.216

/ 1M cached input tokens

$0.40

/ 1M output tokens

deepseek-ai/DeepSeek-V3.2-Exp

DeepSeek

DeepSeek's models are a suite of advanced AI systems that prioritize efficiency, scalability, and real-world applicability.

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
DeepSeek-OCR	8k	$0.03	$0.10	View more
DeepSeek-V3.2-Exp	160k	$0.27 / $0.216 cached	$0.40	View more
DeepSeek-V3.1-Terminus	160k	$0.27 / $0.216 cached	$1.00	View more
DeepSeek-V3.1	160k	$0.27 / $0.216 cached	$1.00	View more
DeepSeek-V3-0324	160k	$0.25	$0.88	View more
DeepSeek-V3	160k	$0.38	$0.89	View more
DeepSeek-R1	160k	$0.70	$2.40	View more
DeepSeek-R1-0528	160k	$0.50 / $0.40 cached	$2.15	View more
DeepSeek-R1-Turbo	40k	$1.00	$3.00	View more
DeepSeek-R1-0528-Turbo	32k	$1.00	$3.00	View more
DeepSeek-R1-Distill-Llama-70B	128k	$0.60	$1.20	View more

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
DeepSeek-OCR	8k	$0.03	$0.10	View more
DeepSeek-V3.2-Exp	160k	$0.27 / $0.216 cached	$0.40	View more
DeepSeek-V3.1-Terminus	160k	$0.27 / $0.216 cached	$1.00	View more
DeepSeek-V3.1	160k	$0.27 / $0.216 cached	$1.00	View more
DeepSeek-V3-0324	160k	$0.25	$0.88	View more
DeepSeek-V3	160k	$0.38	$0.89	View more
DeepSeek-R1	160k	$0.70	$2.40	View more
DeepSeek-R1-0528	160k	$0.50 / $0.40 cached	$2.15	View more
DeepSeek-R1-Turbo	40k	$1.00	$3.00	View more
DeepSeek-R1-0528-Turbo	32k	$1.00	$3.00	View more
DeepSeek-R1-Distill-Llama-70B	128k	$0.60	$1.20	View more

Qwen

Qwen series offers a comprehensive suite of dense and mixture-of-experts models.

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
Qwen3-Next-80B-A3B-Instruct	256k	$0.14	$1.10	View more
Qwen3-Coder-480B-A35B-Instruct-Turbo	256k	$0.29	$1.20	View more
Qwen3-Coder-480B-A35B-Instruct	256k	$0.40	$1.60	View more
Qwen3-235B-A22B-Thinking-2507	256k	$0.30	$2.90	View more
Qwen3-235B-A22B-Instruct-2507	256k	$0.09	$0.57	View more
Qwen3-32B	40k	$0.10	$0.28	View more
Qwen3-30B-A3B	40k	$0.08	$0.29	View more
Qwen3-14B	40k	$0.08	$0.24	View more
Qwen2.5-72B-Instruct	32k	$0.12	$0.39	View more

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
Qwen3-Next-80B-A3B-Instruct	256k	$0.14	$1.10	View more
Qwen3-Coder-480B-A35B-Instruct-Turbo	256k	$0.29	$1.20	View more
Qwen3-Coder-480B-A35B-Instruct	256k	$0.40	$1.60	View more
Qwen3-235B-A22B-Thinking-2507	256k	$0.30	$2.90	View more
Qwen3-235B-A22B-Instruct-2507	256k	$0.09	$0.57	View more
Qwen3-32B	40k	$0.10	$0.28	View more
Qwen3-30B-A3B	40k	$0.08	$0.29	View more
Qwen3-14B	40k	$0.08	$0.24	View more
Qwen2.5-72B-Instruct	32k	$0.12	$0.39	View more

Llama 4

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences.

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
Llama-4-Scout-17B-16E	320k	$0.08	$0.30	View more
Llama-4-Maverick-17B-128E	1024k	$0.15	$0.60	View more
Llama-Guard-4-12B	160k	$0.18	$0.18	View more

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
Llama-4-Scout-17B-16E	320k	$0.08	$0.30	View more
Llama-4-Maverick-17B-128E	1024k	$0.15	$0.60	View more
Llama-Guard-4-12B	160k	$0.18	$0.18	View more

Llama 3

Meta Llama 3 are a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes.

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
Llama-3.3-70B-Instruct-Turbo	128k	$0.13	$0.38	View more
Llama-3.2-11B-Vision-Instruct	128k	$0.049	$0.049	View more
Llama-3.2-3B-Instruct	128k	$0.02	$0.02	View more
Meta-Llama-3.1-70B-Instruct	128k	$0.40	$0.40	View more
Meta-Llama-3.1-70B-Instruct-Turbo	128k	$0.40	$0.40	View more
Meta-Llama-3.1-8B-Instruct	128k	$0.03	$0.05	View more
Meta-Llama-3.1-8B-Instruct-Turbo	128k	$0.02	$0.03	View more
Meta-Llama-3-8B-Instruct	8k	$0.03	$0.06	View more

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
Llama-3.3-70B-Instruct-Turbo	128k	$0.13	$0.38	View more
Llama-3.2-11B-Vision-Instruct	128k	$0.049	$0.049	View more
Llama-3.2-3B-Instruct	128k	$0.02	$0.02	View more
Meta-Llama-3.1-70B-Instruct	128k	$0.40	$0.40	View more
Meta-Llama-3.1-70B-Instruct-Turbo	128k	$0.40	$0.40	View more
Meta-Llama-3.1-8B-Instruct	128k	$0.03	$0.05	View more
Meta-Llama-3.1-8B-Instruct-Turbo	128k	$0.02	$0.03	View more
Meta-Llama-3-8B-Instruct	8k	$0.03	$0.06	View more

Gemini

Developed by Google DeepMind, Gemini is a family of state-of-the-art thinking models with native multimodal capabilities

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
gemini-2.5-pro	976k	$1.25	$10.00	View more
gemini-2.5-flash	976k	$0.30	$2.50	View more
gemini-2.0-flash-001	976k	$0.10	$0.40	View more

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
gemini-2.5-pro	976k	$1.25	$10.00	View more
gemini-2.5-flash	976k	$0.30	$2.50	View more
gemini-2.0-flash-001	976k	$0.10	$0.40	View more

Gemma

Gemma is a family of lightweight, state-of-the-art open models from Google.

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
gemma-3-27b-it	128k	$0.09	$0.16	View more
gemma-3-12b-it	128k	$0.04	$0.13	View more
gemma-3-4b-it	128k	$0.04	$0.08	View more

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
gemma-3-27b-it	128k	$0.09	$0.16	View more
gemma-3-12b-it	128k	$0.04	$0.13	View more
gemma-3-4b-it	128k	$0.04	$0.08	View more

Nemotron

NVIDIA Nemotron is a family of open models customized for efficiency, accuracy, and specialized workloads.

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
NVIDIA-Nemotron-Nano-12B-v2-VL	128k	$0.20	$0.60	View more
Llama-3.1-Nemotron-70B-Instruct	128k	$1.20	$1.20	View more
Llama-3.3-Nemotron-Super-49B-v1.5	128k	$0.10	$0.40	View more
NVIDIA-Nemotron-Nano-9B-v2	128k	$0.04	$0.16	View more

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
NVIDIA-Nemotron-Nano-12B-v2-VL	128k	$0.20	$0.60	View more
Llama-3.1-Nemotron-70B-Instruct	128k	$1.20	$1.20	View more
Llama-3.3-Nemotron-Super-49B-v1.5	128k	$0.10	$0.40	View more
NVIDIA-Nemotron-Nano-9B-v2	128k	$0.04	$0.16	View more

Claude

Developed by Anthropic, Claude is a family of highly performant, trustworthy AI models built for complex reasoning, advanced coding, and nuanced language understanding

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
claude-4-opus	195k	$16.50	$82.50	View more
claude-4-sonnet	195k	$3.30	$16.50	View more
claude-3-7-sonnet-latest	195k	$3.30 / $0.33 cached	$16.50	View more

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
claude-4-opus	195k	$16.50	$82.50	View more
claude-4-sonnet	195k	$3.30	$16.50	View more
claude-3-7-sonnet-latest	195k	$3.30 / $0.33 cached	$16.50	View more

Phi

Phi models offer cost-effective, high-performance AI solutions.

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
phi-4	16k	$0.07	$0.14	View more

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
phi-4	16k	$0.07	$0.14	View more

Mistral

Developed by Mistral AI, a leading French research lab, Mistral is a family of open-source AI models built for multilingual excellence, advanced reasoning, and cost-effective performance

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
Mistral-Small-3.2-24B-Instruct-2506	125k	$0.075	$0.20	View more
Mistral-Small-24B-Instruct-2501	32k	$0.05	$0.08	View more
Mistral-Nemo-Instruct-2407	128k	$0.02	$0.04	View more
Mixtral-8x7B-Instruct-v0.1	32k	$0.54	$0.54	View more

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
Mistral-Small-3.2-24B-Instruct-2506	125k	$0.075	$0.20	View more
Mistral-Small-24B-Instruct-2501	32k	$0.05	$0.08	View more
Mistral-Nemo-Instruct-2407	128k	$0.02	$0.04	View more
Mixtral-8x7B-Instruct-v0.1	32k	$0.54	$0.54	View more

Voxtral

Voxtral is a family of audio models with state-of-the-art speech to text capabilities.

Model	$ per minute of audio input	Actions
Voxtral-Small-24B-2507	$0.00300	View more
Voxtral-Mini-3B-2507	$0.00100	View more

Model	$ per minute of audio input	Actions
Voxtral-Small-24B-2507	$0.00300	View more
Voxtral-Mini-3B-2507	$0.00100	View more

Mixture of experts

Mixture of expert models split the computations into multiple expert subnetworks providing a strong performance.

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
Mixtral-8x7B-Instruct-v0.1	32k	$0.54	$0.54	View more
WizardLM-2-8x22B	64k	$0.48	$0.48	View more

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
Mixtral-8x7B-Instruct-v0.1	32k	$0.54	$0.54	View more
WizardLM-2-8x22B	64k	$0.48	$0.48	View more

Less than 10 billion parameters

Our fastest and best value models but they might not be so precise.

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
Meta-Llama-3-8B-Instruct	8k	$0.03	$0.06	View more
Meta-Llama-3.1-8B-Instruct	128k	$0.03	$0.05	View more
gemma-3-4b-it	128k	$0.04	$0.08	View more

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
Meta-Llama-3-8B-Instruct	8k	$0.03	$0.06	View more
Meta-Llama-3.1-8B-Instruct	128k	$0.03	$0.05	View more
gemma-3-4b-it	128k	$0.04	$0.08	View more

Between 10 and 70 billion parameters

Models that are fine-tuned for a balance between speed and precision.

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
MythoMax-L2-13b	4k	$0.08	$0.08	View more
gemma-3-27b-it	128k	$0.09	$0.16	View more
gemma-3-12b-it	128k	$0.04	$0.13	View more

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
MythoMax-L2-13b	4k	$0.08	$0.08	View more
gemma-3-27b-it	128k	$0.09	$0.16	View more
gemma-3-12b-it	128k	$0.04	$0.13	View more

70 billion parameters and up

Models are our most capable models capable of handling complex tasks but also our most expensive and might be slower to respond.

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
Meta-Llama-3.1-70B-Instruct	128k	$0.40	$0.40	View more

Model	Context	$ per 1M input tokens	$ per 1M output tokens	Actions
Meta-Llama-3.1-70B-Instruct	128k	$0.40	$0.40	View more

Flux

Developed by Black Forest Labs, Flux is a family of state-of-the-art image generation and editing models that deliver exceptional visual quality with breakthrough prompt accuracy and photorealism.

Model	$ per image	Actions
FLUX.1-Kontext-dev	$0.01 x (w / 1024) x (h / 1024) x (iters / 25)	View more
FLUX-1-Redux-dev	$0.012 x (w / 1024) x (h / 1024) x (iters / 25)	View more
FLUX-1-dev	$0.009 x (w / 1024) x (h / 1024) x (iters / 25)	View more
FLUX-1-schnell	$0.0005 x (w / 1024) x (h / 1024) x iters	View more
FLUX-pro	$0.05	View more
FLUX-1.1-pro	$0.04	View more

Model	$ per image	Actions
FLUX.1-Kontext-dev	$0.01 x (w / 1024) x (h / 1024) x (iters / 25)	View more
FLUX-1-Redux-dev	$0.012 x (w / 1024) x (h / 1024) x (iters / 25)	View more
FLUX-1-dev	$0.009 x (w / 1024) x (h / 1024) x (iters / 25)	View more
FLUX-1-schnell	$0.0005 x (w / 1024) x (h / 1024) x iters	View more
FLUX-pro	$0.05	View more
FLUX-1.1-pro	$0.04	View more

Custom LLMs

You can deploy your own model on our hardware and pay for uptime. You get dedicated SXM-connected GPUs (for multi-GPU setups), automatic scaling to handle load fluctuations and a very competitive price. Read More

Dedicated A100, H100, H200 and B200 GPUs for your custom LLM needs
Billed in minute granularity
Invoiced weekly

Deploy

GPU	Memory	Price
A100	80GB	$0.89 / GPU-hour
H100	80GB	$1.69 / GPU-hour
H200	141GB	$1.99 / GPU-hour
B200	180GB	$2.49 / GPU-hour

Dedicated Instances and Clusters

For dedicated instances, DGX H100, and B200 clusters with 3.2Tbps bandwidth, please contact us at dedicated@deepinfra.com

Embeddings Pricing

Model	Context	$ per 1M input tokens
bge-base-en-v1.5	512	$0.005
bge-en-icl	8k	$0.01
bge-large-en-v1.5	512	$0.01
bge-m3	8k	$0.01
bge-m3-multi	8k	$0.01
gte-base	512	$0.005
gte-large	512	$0.01
e5-base-v2	512	$0.005
e5-large-v2	512	$0.01
multilingual-e5-large	512	$0.01
all-MiniLM-L12-v2	512	$0.005
all-MiniLM-L6-v2	512	$0.005
all-mpnet-base-v2	512	$0.005
multi-qa-mpnet-base-dot-v1	512	$0.005
paraphrase-MiniLM-L6-v2	512	$0.005
text2vec-base-chinese	512	$0.005

Hardware

All models run on H100 or A100 GPUs, optimized for inference performance and low latency.

Auto Scaling

Our system will automatically scale the model to more hardware based on your needs. We limit each account to 200 concurrent requests. If you want more drop us a line

Billing

You have to add a card or pre-pay or you won't be able to use our services. An invoice is always generated at the beginning of the month, and also throughout the month if you hit your tier invoicing threshold. You can also set a spending limit to avoid surprises.

Usage Tiers

Every user is part of a usage tier. As your usage and your spending goes up, we automatically move you to the next usage tier. Every tier has an invoicing threshold. Once reached an invoice is automatically generated.