Inference Economics: True AI Costs at Scale

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.04.28 by DeepInfra

Most teams discover their inference economics the same way: a production bill arrives that looks nothing like the number they expected. The per-token price seemed small enough during testing. Then real traffic showed up, agents started chaining calls, RAG pipelines bloated the context window, and suddenly the math looked completely different.

Token prices have fallen about 10x every year since 2021. Equivalent GPT-4 class performance that cost $20 per million tokens in late 2022 now runs closer to $0.40. That is a genuinely remarkable trend. But total AI spend for companies in production has gone up, not down, over the same period. More capable models invite more ambitious use cases, which means more tokens, more calls, and more infrastructure complexity. Understanding inference economics is not just about finding the cheapest model. It is about understanding where your tokens actually go and making deliberate choices at each layer.

The Real Cost Drivers Behind Your Inference Bill

Your actual inference cost is a function of four things:

How many tokens you send per request
How many tokens the model generates per response
How many requests you run per day
What you pay per token at your chosen provider.

The first three are almost entirely in your control and they compound fast.

A request that sends 2,000 input tokens and receives 500 output tokens costs roughly 4x less than one sending 8,000 input tokens and receiving 2,000 output tokens, all else equal. At 50,000 requests per day, that difference translates into a cost ratio that can determine whether a product is viable or not.

Output tokens are also consistently more expensive than input tokens across every major provider. The ratio varies: Claude 4 Sonnet charges 5x more for output than input, DeepSeek V3.2 on DeepInfra charges about 1.5x more. For workloads that generate long completions, like document drafting, code synthesis, or detailed agent outputs, the output cost dominates the bill by a wide margin and deserves more attention than most teams give it.

How Model Choice Shapes Unit Economics

Not all models are priced equally for the same reason. Larger, denser models require more GPU memory and more compute per token. MoE (Mixture of Experts) architectures activate only a fraction of their parameters per forward pass, which is why models like DeepSeek V3 and Kimi K2 can be large in total parameter count while remaining relatively cheap to serve.

This is worth understanding because it means model size alone is a poor proxy for cost. What matters more is the architecture and how efficiently it can be served at inference time.

Here is a practical snapshot of current DeepInfra pricing across capability tiers:

Frontier and near-frontier:

Model	Input ($/1M)	Output ($/1M)	Architecture
Claude 4 Opus	$16.50	$82.50	Dense
Claude 4 Sonnet	$3.30	$16.50	Dense
Gemini 2.5 Pro	$1.25	$10.00	Dense / MoE
DeepSeek R1-0528	$0.50	$2.15	MoE
Kimi K2 0905	$0.50	$2.00	MoE

Strong mid-tier:

Model	Input ($/1M)	Output ($/1M)	Architecture
DeepSeek V3.2	$0.26	$0.38	MoE
DeepSeek V3.1	$0.21	$0.79	MoE
Gemini 2.5 Flash	$0.30	$2.50	MoE
Llama 4 Maverick	$0.15	$0.60	MoE
NVIDIA Nemotron 3 Super	$0.10	$0.50	MoE

Budget tier:

Model	Input ($/1M)	Output ($/1M)	Architecture
Qwen3-32B	$0.08	$0.28	Dense
Llama 4 Scout	$0.08	$0.30	MoE
Gemma 3 27B	$0.08	$0.16	Dense
Mistral Small 24B	$0.05	$0.08	Dense
Llama 3.1 8B	$0.02	$0.05	Dense

Pricing from deepinfra.com/pricing as of April 2026.

The inference economics case for MoE models is strong for high-volume workloads. DeepSeek V3.2 at $0.26 input and $0.38 output delivers performance that competes with models costing several times more. For a production RAG pipeline running 100,000 daily requests with 4,000 input tokens and 800 output tokens, the monthly cost difference between DeepSeek V3.2 and Claude 4 Sonnet is roughly $1,600 vs. $18,000. Both are reasonable choices for different situations. But that gap warrants a deliberate decision rather than a default.

Where Caching and Context Management Change the Picture

Two of the highest-leverage optimizations in inference economics cost nothing to set up beyond some thought about your prompt structure.

Cached input pricing is available on several models on DeepInfra, including Kimi K2, DeepSeek V3, and Claude. When the same prefix, such as a long system prompt, a shared document, or a static knowledge block, appears at the start of many requests, those tokens can be served at a significantly discounted rate. DeepSeek V3.2 cached input runs at $0.13 per million versus $0.26 standard. Claude 3.7 Sonnet cached input runs at $0.33 per million versus $3.30 standard. For any workload with a consistent system prompt or repeated context, this is not a minor optimization. It is a cost structure change.

Context window management is the other lever. RAG pipelines are particularly prone to context bloat. It is common to pass three to five full documents into a prompt when only a paragraph or two is actually relevant to the query. Tightening retrieval to return shorter, higher-precision chunks rather than whole documents can cut input tokens by 40 to 60 percent with no quality regression on the output side. That saving applies on every single request.

A simple example: a customer support pipeline sending 6,000 tokens per request at 80,000 daily requests on Claude 4 Sonnet costs around $52,000 per month in input tokens alone. Reducing average input to 2,500 tokens through better retrieval brings that to around $21,000. No model switch required.

What Agentic Workloads Do to Your Cost Model

Standard API cost estimates are built around single-turn or short-turn interactions. Agentic workflows break that model entirely.

A single user-initiated task in an agentic system can trigger anywhere from 5 to 20 individual LLM calls as the agent reasons through steps, calls tools, processes results, and verifies outputs. Each call carries its own context window, often including the full conversation history or a growing scratchpad. The token count per task compounds quickly.

The implication is that the per-token price matters more in agentic settings than in any other context. A $0.50 per million input token model and a $3.30 per million input token model may feel similar in a single-turn setting. Across 15 calls per user task with accumulating context, that difference becomes a serious unit economics question.

There are three practical responses to this. First, be deliberate about which model handles which step. Routing planning and reasoning steps to a stronger model while delegating retrieval, formatting, and summarization to a cheaper one cuts cost without degrading the quality of outputs that actually matter. Second, keep the context window tight at each step. Passing the full conversation history into every sub-call is expensive and often unnecessary. Passing only the relevant state reduces token consumption significantly. Third, evaluate whether a reasoning model like DeepSeek R1 is actually needed for each step or whether a fast, capable non-reasoning model like DeepSeek V3.2 handles the task just as well at a fraction of the output cost.

Picking the Right Pricing Tier for Your Traffic Pattern

Inference economics on DeepInfra works on pay-as-you-go token pricing with no long-term contracts or upfront costs. That structure works well across a range of traffic patterns, but the model you choose should match both your quality requirements and your volume.

For low-to-moderate volume where quality is paramount, the closed source models on DeepInfra (Claude and Gemini) make sense. The per-token price is higher but the request volume is manageable, and the quality ceiling is the highest available anywhere.
For high-volume production workloads where cost per completion drives unit economics, the mid-tier open source models are the clear choice. DeepSeek V3.2, Kimi K2, and Qwen3-235B all sit in a range where you get near-frontier quality without the frontier price tag. At 500,000 or more daily requests, even a $0.10 per million token difference in input price adds up to thousands of dollars a month.
For high-throughput, low-complexity tasks like classification, extraction, short summarization, and routing decisions, the budget tier models are often more than sufficient. Llama 4 Scout at $0.08 input and $0.30 output, or Mistral Small at $0.05 and $0.08, can handle a large share of production traffic for teams willing to measure before assuming they need a larger model.

The most effective approach most teams land on is tiered routing: a cheap, fast model handles the majority of requests, a mid-tier model takes on moderate complexity, and the flagship model is reserved for tasks where the quality difference is measurable and worth paying for. With DeepInfra’s consistent latency and tight TTFT variance across all tiers, adding a routing layer does not introduce meaningful latency overhead.

Inference economics, at its core, is about making that routing decision deliberately rather than by default. The models available today are capable enough that paying frontier prices for every token in your pipeline is rarely the right answer.

Related articles on DeepInfra

Fork of Text Generation Inference.The text generation inference open source project by huggingface looked like a promising framework for serving large language models (LLM). However, huggingface announced that they will change the license of code with version v1.0.0. While the previous license Apache 2.0 was permissive, the new on...

From Precision to Quantization: A Practical Guide to Faster, Cheaper LLMs<p>Large language models live and die by numbers—literally trillions of them. How finely we store those numbers (their precision) determines how much memory a model needs, how fast it runs, and sometimes how good its answers are. This article walks from the basics to the deep end: we’ll start with how computers even store a […]</p>

How to deploy google/flan-ul2 - simple. (open source ChatGPT alternative)Flan-UL2 is probably the best open source model available right now for chatbots. In this post we will show you how to get started with it very easily. Flan-UL2 is large - 20B parameters. It is fine tuned version of the UL2 model using Flan dataset. Because this is quite a large model it is not eas...

View all