DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

As of May 2026, seven API providers offer access to Gemma 4 26B A4B, and the spread in performance and cost is wide enough to matter in production. Blended pricing ranges from $0.00 (Google AI Studio free tier) to $0.70 per 1M tokens, TTFT spans 0.68s to 5.51s, and output speed varies by nearly 5x between the fastest and slowest providers. This breakdown covers all seven — benchmarks, pricing, and which provider fits which workload.
Quick Reference: Best Provider by Use Case
Gemma 4 26B A4B — Best APIs
| Provider | Why it’s a best pick | Blended ($/1M) | Input ($/1M) | Output ($/1M) | TTFT (s) | Speed (t/s) | Context |
|---|---|---|---|---|---|---|---|
| DeepInfra | Best value + best latency | 0.10 | 0.07 | 0.34 | 0.68 | 39 | 262k |
| Parasail | Best low-cost, higher throughput | 0.10 | 0.13 | 0.40 | 1.22 | 69 | 256k |
| Cloudflare | Best balanced throughput/latency; lowest output price | 0.12 | 0.10 | 0.30 | 0.84 | 68 | 256k |
| Clarifai | Absolute fastest output speed (premium-priced) | 0.70 | — | — | 0.88 | 153 | 256k |
Gemma 4 26B A4B is Google DeepMind’s Mixture-of-Experts model from the Gemma 4 family, released April 3, 2026 under the Apache 2.0 license. The “A” in 26B A4B stands for “active parameters”: while the model contains 25.2 billion total parameters, it activates only 3.8 billion per token during generation, allowing it to run at roughly the speed of a 4B dense model. All 26 billion parameters must be loaded into memory to support fast routing. The model is available on DeepInfra and the 31B dense variant is also available for workloads that need additional capability headroom.
The model uses a hybrid attention mechanism interleaving local sliding window attention with full global attention, supporting up to 256K tokens context. Key capabilities: built-in reasoning mode (<|think|> token), native system prompt support, function calling, JSON mode, and multimodal input (text + image). All models cover 140+ languages.
| Provider | TTFT (s) | Speed (t/s) | Context | Blended ($/1M) | Key Feature |
|---|---|---|---|---|---|
| DeepInfra | 0.68s | 39.4 | 262k | $0.10 | Lowest latency |
| Clarifai | 0.88s | 153.2 | 256k | $0.70 | Highest throughput |
| Cloudflare | 0.84s | 68.3 | 256k | $0.12 | Balanced; lowest output price |
| Parasail | 1.22s | 68.6 | 256k | $0.10 | High-speed budget |
| Google AI Studio | 1.63s | 46.7 | 262k | $0.00 | Free tier prototyping |
| Novita | 1.79s | 36.0 | 262k | $0.16 | Mid-tier generalist |
| GMI (FP8) | 5.51s | 29.0 | 1M | $0.16 | Maximum context |
1. DeepInfra — Best Overall & Lowest Latency
DeepInfra delivers the lowest TTFT in the benchmark at 0.68s — critical for interactive applications where time to first token defines perceived responsiveness. Combined with the lowest input price in the set ($0.07/1M) and a tied-lowest blended price ($0.10/1M), it is the most technically balanced option for prompt-intensive production workloads. The 262k context window matches the model’s maximum supported length. For a detailed cost breakdown by workload type, see the Gemma 4 pricing guide.
2. Clarifai — Highest Throughput
Clarifai leads all 7 providers on output speed at 153.2 t/s — 289% faster than the slowest provider in the set. For batch processing, bulk generation, or any workload where sustained output speed is the primary constraint, it is the clear choice. The trade-off is price: at $0.70/1M blended, it is up to 7x more expensive than the baseline market rate.
3. Cloudflare — Balanced with Lowest Output Price
Cloudflare hits a strong balance across speed, latency, and cost, and is the right choice for generation-heavy workloads. Its output token price of $0.30/1M is the lowest in the benchmark — directly relevant for coding assistants, report generation, or any app where the model talks a lot. Input pricing ($0.10/1M) is slightly higher than DeepInfra, so it is less attractive for prompt-dominated workloads.
4. Parasail — High-Speed Budget Alternative
Parasail matches DeepInfra’s tied-lowest blended price while posting significantly faster output speed (68.6 t/s vs 39.4 t/s). The trade-off is TTFT: at 1.22s, it is nearly double DeepInfra’s 0.68s — a meaningful difference for interactive applications. For asynchronous or batch workloads where TTFT is not a constraint, Parasail is the strongest low-cost alternative.
5. Google AI Studio — Free Tier Prototyping
Google AI Studio provides first-party access to Gemma 4 26B A4B at no cost, making it the natural starting point for development and evaluation. Its 1.63s TTFT and mid-range output speed (46.7 t/s) are adequate for prototyping but not competitive for live, user-facing applications. Useful for developers who want to validate model behavior before committing to a production provider.
6. Novita — Mid-Tier Generalist
Novita supports the full 262k context window but does not stand out on any performance metric. Its TTFT (1.79s) and throughput (36.0 t/s) are below the benchmark median, and its $0.16/1M blended price is above both DeepInfra and Cloudflare. It is best used as a fallback option in intelligent routing setups rather than a primary provider.
7. GMI (FP8) — Maximum Context Specialist
GMI is the outlier in this benchmark. Its 5.51s TTFT and 29.0 t/s output speed are the worst in the set, but it is the only provider offering a 1M token context window for Gemma 4 26B A4B. For massive document processing tasks — full codebase ingestion, large case files, or extreme long-context RAG — that capability may justify the latency trade-off. It is not suitable for interactive use cases.
Across the 7 benchmarked providers, DeepInfra is the recommended API for production Gemma 4 26B A4B deployment. It leads on TTFT (0.68s), has the lowest input price ($0.07/1M), ties for the lowest blended price ($0.10/1M), and supports the model’s full 262k context window alongside JSON mode and function calling. For workloads requiring maximum output throughput, Clarifai is the right choice at a significant price premium. For output-heavy workloads where response verbosity dominates cost, Cloudflare’s $0.30/1M output rate deserves a closer look. For extreme long-context tasks, GMI is the only provider offering 1M token context.
For teams evaluating Gemma 4 against other open-weight models on the same infrastructure, the full text generation model catalog and the open vs. closed source model guide are useful reference points. To get started, visit the Gemma 4 26B A4B model page on DeepInfra.
FLUX.1-dev Guide: Mastering Text-to-Image AI Prompts for Stunning and Consistent VisualsLearn how to craft compelling prompts for FLUX.1-dev to create stunning images.
NVIDIA Nemotron 3 Nano 30B API Benchmarks: Latency & Cost<p>About NVIDIA Nemotron 3 Nano 30B A3B NVIDIA Nemotron 3 Nano 30B A3B is a large language model trained from scratch by NVIDIA, designed as a unified model for both reasoning and non-reasoning tasks. It is part of the Nemotron 3 family — NVIDIA’s most efficient family of open models, built for agentic AI applications. […]</p>
Gemma 4 Pricing, Benchmarks & Real-World Cost Analysis<p>Gemma 4 puts a serious open-weight reasoning model into a genuinely competitive provider market. The same Gemma 4 26B A4B model is available across seven API providers, with blended pricing ranging from $0.10 to $0.70 per 1M tokens — real variation that changes production economics. Released April 3, 2026 by Google DeepMind under Apache 2.0, […]</p>
© 2026 DeepInfra. All rights reserved.