DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Gemma 4 puts a serious open-weight reasoning model into a genuinely competitive provider market. The same Gemma 4 26B A4B model is available across seven API providers, with blended pricing ranging from $0.10 to $0.70 per 1M tokens — real variation that changes production economics. Released April 3, 2026 by Google DeepMind under Apache 2.0, it uses a Mixture-of-Experts design with 25.2B total parameters and 3.8B active per token, a 262K context window, native function calling, and multimodal input support.
For technical teams, the question is whether the combination of open licensing, long context, reasoning support, and provider-level price variation gives you a better production envelope for your workload. The tradeoffs are already in the data: Clarifai leads on throughput and time to first answer token, DeepInfra stands out on blended cost and TTFT, Cloudflare is competitive on output pricing and latency, and Google AI Studio offers direct managed access from the model’s creator.
Gemma 4 26B A4B is an open-weight Google DeepMind reasoning model with a 262K context window, Apache 2.0 licensing, and broad provider availability across Cloudflare, DeepInfra, Google AI Studio, Parasail, Novita, GMI, and Clarifai. Pricing is unusually competitive for this class of model, with blended provider pricing ranging from $0.10 to $0.70 per 1M tokens. For most developers it is best suited to cost-sensitive production workloads, long-context assistants, and multimodal pipelines where you want strong capability without stepping into premium closed-model pricing. See the open vs. closed source model comparison for a broader view of where open models like Gemma 4 fit the current landscape.
| Best For | Provider | Why |
|---|---|---|
| Lowest price / cost-sensitive workloads | DeepInfra | Ties for the lowest blended price at $0.10/1M tokens and has the lowest listed input price at $0.07/1M input tokens. |
| Fastest time-to-first-token for interactive apps | DeepInfra | Artificial Analysis reports the lowest non-answer TTFT at 0.68s, ahead of Cloudflare and Clarifai. |
| Proprietary or managed model access | Google AI Studio | Direct hosted access from the model’s developer, with function calling and JSON mode support. |
| RAG, document-heavy, or high-throughput use cases | Clarifai | Highest measured output speed at 153.1 t/s and lowest time to first answer token at 13.95s — strong for long responses and throughput-heavy serving. |
| Balanced price and output cost | Cloudflare | Blended price of $0.12/1M tokens with the lowest listed output price at $0.30/1M output tokens. |
| Lowest blended cost alternative to DeepInfra | Parasail | Matches DeepInfra at $0.10 blended per 1M tokens; faster output speed at 68.6 t/s vs 39.4 t/s. |
Gemma 4 pricing is token-based, so your bill follows the amount of text, images, and generated output that move through the API. If you have ever been surprised by a “cheap” model that got expensive once it started thinking out loud and returning long answers, this is the part to pay attention to. For a deeper primer on the math, see the token math and cost-per-completion guide.
A token is a chunk of text, not a word. Short words may be one token; longer words, punctuation, code, JSON, and multilingual text often split into more. For practical budgeting: 1 token ≈ 0.75 words, 1,000 tokens ≈ a few paragraphs. With Gemma 4, token costs matter more because the model supports very long context up to 262K tokens, reasoning/thinking mode which can increase generated output, structured outputs and tool calls which add verbose JSON, and multimodal inputs where images and documents can quietly expand token usage.
| Token type | What it is | Why it matters |
|---|---|---|
| Input tokens | Everything you send in the request: system prompt, user prompt, chat history, tool schemas, JSON instructions, and any serialized context | Long system prompts, big RAG payloads, and repeated chat history can dominate spend even before the model answers. |
| Output tokens | Everything the model generates back: answer text, code, JSON, tool call arguments, and reasoning content depending on implementation | Output tokens usually cost more than input tokens. Long answers, verbose JSON, and agent workflows can turn a low-cost request into an expensive one. |
| Cached input tokens | Reused prompt content billed at a lower effective rate in blended pricing models | Matters for apps with repeated system prompts, long documents, or stable context. Artificial Analysis uses a 7:2:1 cache-input-output ratio for blended comparisons. |
| Reasoning tokens | Tokens spent during the model’s internal thinking process or reasoning mode | On reasoning models, latency and cost can diverge from what the visible answer length suggests. |
| Tool / function-call tokens | Tokens used to describe tools, arguments, schemas, and tool-call outputs | Large tool schemas and verbose tool results can bloat both input and output token counts. |
| Multimodal tokens | Tokens derived from non-text inputs such as images and video frames | OCR-heavy documents, screenshots, charts, and frame-by-frame analysis can expand token usage fast. |
Where Gemma 4 token costs help — and where they bite
Provider token-cost tradeoffs
| Provider | Input /1M | Output /1M | Blended /1M | Advantage | Drawback |
|---|---|---|---|---|---|
| DeepInfra | $0.07 | $0.34 | $0.10 | Lowest listed input cost; tied for lowest blended; strong default for long prompts and RAG | Output is not the cheapest — long generated responses cost more than on Cloudflare |
| Cloudflare | $0.10 | $0.30 | $0.12 | Lowest listed output cost; good for verbose assistants, coding, and generation-heavy apps | Higher input cost than DeepInfra — large prompts and long chat history add up faster |
| Parasail | $0.13 | $0.40 | $0.10 | Ties for lowest blended price in the benchmark methodology | Direct input and output rates are both worse than DeepInfra and Cloudflare — real cost depends heavily on workload shape |
| OpenRouter | $0.06 | $0.33 | — | Very competitive listed rates; useful single integration path | Effective cost can vary with routing behavior and underlying provider choice |
| Clarifai | — | — | $0.70 | May still be justified when speed is more valuable than token price | Highest blended cost in the benchmark by a wide margin |
| Google AI Studio | — | — | — | Direct hosted access from the model creator | Public benchmark data does not provide competitive token pricing |
| Novita | — | — | $0.16 | Mid-pack blended price | Not among the cheapest options in the benchmark |
| GMI (FP8) | — | — | $0.16 | Moderate blended cost | Slower than leading providers |
Practical budgeting rules for Gemma 4
DeepInfra runs on bare-metal infrastructure — cutting out layers of cloud virtualization helps reduce overhead and keep both latency and serving costs tighter. That is a big reason bare-metal-first providers are often able to undercut major cloud platforms, and DeepInfra is typically 50–80% cheaper than those larger-cloud alternatives. If you are a developer, a high-volume API user, or a team watching every token dollar, this is the sort of provider worth shortlisting first.
| Model | Best Use Case | Context Window | Input ($/1M) | Output ($/1M) |
|---|---|---|---|---|
| Gemma 4 26B A4B | Cost-efficient reasoning, long-context assistants, multimodal API workloads | 262,144 tokens | $0.07 | $0.34 |
On DeepInfra, Gemma 4 26B A4B is priced at $0.07/1M input tokens and $0.34/1M output tokens — one of the cheapest ways in this dataset to run a serious long-context reasoning model, especially for prompt-heavy workloads where input pricing does most of the damage. Teams stepping up to a larger member of the family can also evaluate the Gemma 4 31B for production deployments that need additional capability headroom.
The scenarios below reflect Gemma 4 workloads where DeepInfra is a strong fit — input-heavy, low TTFT requirements, or simply cost-sensitive at production scale.
Scenario 1: RAG support bot with long document context
Each request pulls in product docs, policy snippets, or internal knowledge before generating a short answer. This is exactly the kind of workload where DeepInfra’s $0.07/1M input tokens helps, because prompt volume usually dominates cost.
| Metric | Value |
|---|---|
| Volume | 5M requests/month |
| Model | Gemma 4 26B A4B |
| Provider | DeepInfra |
| Input Tokens | 10,000 per request |
| Output Tokens | 500 per request |
| Monthly Cost | $4,350 |
Cost breakdown:
Low input pricing plus the lowest reported TTFT at 0.68s makes DeepInfra a strong option for document-heavy assistants that need to feel responsive before the full answer arrives.
Comparison: The same workload on Parasail would cost $7,500/month — $3,150 more.
Scenario 2: Interactive coding copilot with large prompt state
A lot of source context goes in, but the reply is relatively compact. DeepInfra is attractive for the same reason as RAG: cheap input tokens and fast initial token latency.
| Metric | Value |
|---|---|
| Volume | 20M requests/month |
| Model | Gemma 4 26B A4B |
| Provider | DeepInfra |
| Input Tokens | 2,000 per request |
| Output Tokens | 300 per request |
| Monthly Cost | $4,840 |
Cost breakdown:
Coding copilots are often gated by prompt size, not just answer size. DeepInfra’s input pricing keeps large file context, system instructions, and tool schemas from becoming the main billing problem. For teams that want a smaller and cheaper option to prototype against first, Gemma 3 4B is a useful baseline before scaling up to Gemma 4.
Comparison: The same workload on Cloudflare would cost $5,800/month — $960 more.
Scenario 3: High-volume JSON extraction pipeline
Turning invoices, forms, screenshots, or semi-structured documents into JSON. Gemma 4 supports structured output and function calling, and DeepInfra combines that with low input pricing that helps when every request includes extraction instructions plus raw document text.
| Metric | Value |
|---|---|
| Volume | 50M requests/month |
| Model | Gemma 4 26B A4B |
| Provider | DeepInfra |
| Input Tokens | 1,500 per request |
| Output Tokens | 150 per request |
| Monthly Cost | $7,800 |
Cost breakdown:
Repetitive, production-scale extraction jobs are where a few cents per million tokens becomes real money. DeepInfra is also tied for the lowest blended price at $0.10/1M, which matters when prompt structures are reused heavily.
Comparison: The same workload on Parasail would cost $12,750/month — $4,950 more.
Scenario 4: Multimodal document assistant for screenshots and PDFs
Gemma 4 on DeepInfra supports text and image input, useful for support dashboards, OCR-adjacent document workflows, and UI understanding tasks. Teams evaluating image and video support can browse the full multimodal model catalog to see how Gemma 4 stacks up against other vision-capable open models.
| Metric | Value |
|---|---|
| Volume | 2M requests/month |
| Model | Gemma 4 26B A4B |
| Provider | DeepInfra |
| Input Tokens | 8,000 per request |
| Output Tokens | 400 per request |
| Monthly Cost | $1,392 |
Cost breakdown:
Multimodal pipelines often become input-heavy fast, especially when extracted text, OCR content, and long instructions are bundled together — playing directly into DeepInfra’s strongest pricing advantage.
Comparison: The same workload on Cloudflare would cost $1,840/month — $448 more.
Scenario 5: Cached-context internal assistant
An internal assistant with a stable system prompt, repeated policy context, and reused task framing. Blended pricing matters more than raw headline rates here. Artificial Analysis puts DeepInfra at $0.10/1M tokens blended, tied for the lowest in the benchmark.
| Metric | Value |
|---|---|
| Volume | 100M effective tokens/month (7:2:1 cache-input-output mix) |
| Model | Gemma 4 26B A4B |
| Provider | DeepInfra |
| Monthly Cost | $10 |
This is the best case for DeepInfra — reused prompt material, repeated workflows, and lots of internal traffic where the 7:2:1 benchmark methodology is a decent approximation.
Comparison: The same 100M-token workload on Clarifai at $0.70/1M blended would cost $70/month — 7x more.
The pattern is clear: if your Gemma 4 app is prompt-heavy, context-heavy, multimodal, or built around repeated prompt structure, DeepInfra is one of the easiest providers to justify on cost. It is not the cheapest on output tokens, so it is not always the best choice for extremely verbose generation. For the kinds of workloads most production teams actually run — RAG, extraction, internal copilots, and structured assistants — it is a very strong default. Teams running across many open-weight models can also explore the broader model directory to see which other reasoning and chat models share the same pricing structure.
Choosing a provider for Gemma 4 26B A4B is about matching your workload shape to the provider whose pricing structure rewards it. Input-heavy apps pay differently than output-heavy ones. Cached-context assistants look different in the billing data than multimodal extraction pipelines. The model is the same across providers; what changes is which economics align with how your app actually generates tokens.
For most developers the practical decision comes down to three things: where your token volume lands (input versus output), whether your prompt structure repeats enough to benefit from blended pricing, and how much TTFT affects your user experience. If you are building something prompt-heavy — RAG, document pipelines, structured extraction — DeepInfra’s $0.07 input pricing is a real advantage that compounds at scale. If your app generates long responses and output volume dominates, Cloudflare’s $0.30 output rate deserves a closer look. For a broader view of how these token economics compare across the open-weight model landscape, see the open vs. closed source model guide.
One thing worth keeping in mind: Gemma 4’s 262K context window and reasoning support are genuinely useful capabilities, but they also create new ways to spend tokens unintentionally. Test with realistic traffic before you commit to a provider at scale. If you want to start hands-on, the Gemma 4 26B A4B demo on DeepInfra is a fast way to get a feel for the model. The pricing is transparent, the API is OpenAI-compatible, and the cost floor is low enough that there is no good reason not to run your own numbers.
Art That Talks Back: A Hands-On Tutorial on Talking ImagesTurn any image into a talking masterpiece with this step-by-step guide using DeepInfra’s GenAI models.
Kimi K2.6 Model Overview: Architecture, Features & Capabilities<p>Kimi K2.6 is Moonshot AI’s latest flagship open-source model, released on April 20, 2026 under a Modified MIT license. It is a native multimodal agentic model built on a 1-trillion parameter Mixture-of-Experts (MoE) architecture, with 32 billion parameters activated per token. The model is designed for long-horizon coding, autonomous execution, and multi-agent orchestration, and is […]</p>
Introducing NVIDIA Nemotron 3 Nano Omni on DeepInfraDeepInfra is an official launch partner for NVIDIA Nemotron 3 Nano Omni, the first multimodal model in the Nemotron 3 family — a single open model that understands images, video, audio, documents, and text in one unified inference pass.© 2026 DeepInfra. All rights reserved.