DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Gemma 4 is Google DeepMind’s latest family of open-weight models, released on April 3, 2026 under the Apache 2.0 license. The family spans four model sizes — from edge-optimized variants for mobile devices to a 31B dense model for server-side deployments — with every model supporting multimodal input, built-in reasoning, and a context window of up to 256K tokens. The 26B A4B Mixture-of-Experts variant and the 31B dense model are both available on DeepInfra.
Model Variants
Gemma 4 is available in four sizes, each targeting a different deployment context:
Attention and Context
All Gemma 4 models use a hybrid attention mechanism that interleaves local sliding window attention with full global attention. Global attention layers apply Proportional RoPE (p-RoPE) to keep memory overhead manageable for long-context tasks. The 26B A4B and 31B models support a 256K token context window; E2B and E4B support 128K.
Reasoning and Multimodality
All models include a built-in reasoning engine that allows step-by-step processing before generating a final response, triggered via the <|think|> token or the enable_thinking parameter. Native system prompt support gives developers control over model behavior and conversational structure.
Gemma 4 processes interleaved text and images with variable aspect ratio and resolution support across all model sizes. Video analysis (up to 60 seconds at 1fps) and native audio processing are supported on E2B and E4B variants. The 26B A4B model handles text and image inputs. All models cover 140+ languages.
The model family is compatible with JAX, Keras, Unsloth, and standard transformers fine-tuning frameworks.
Benchmark results below are for the instruction-tuned 26B A4B variant with thinking enabled, unless noted.
26B A4B — Standard Benchmarks
| Benchmark | Category | Score |
|---|---|---|
| MMLU Pro | General Knowledge | 82.6% |
| GPQA Diamond | Science Reasoning | 82.3% |
| AIME 2026 (no tools) | Mathematics | 88.3% |
| LiveCodeBench v6 | Coding | 77.1% |
| MMMLU | Multilingual | 86.3% |
| MMMU Pro | Multimodal (Vision) | 73.8% |
Generational Comparison
| Benchmark | Gemma 4 31B (Dense) | Gemma 4 26B A4B (MoE) | Gemma 3 27B |
|---|---|---|---|
| MMLU Pro | 85.2% | 82.6% | 67.6% |
| AIME 2026 | 89.2% | 88.3% | 20.8% |
| LiveCodeBench v6 | 80.0% | 77.1% | 29.1% |
| MMMU Pro | 76.9% | 73.8% | 49.7% |
The jump from Gemma 3 27B is most pronounced in math (20.8% → 88.3% on AIME 2026), coding (29.1% → 77.1% on LiveCodeBench v6), and agentic tool use. The dense 31B model wins on raw quality across benchmarks but by narrow margins — typically 2–4 points. Additional results: Codeforces ELO 1718; OmniDocBench 1.5 document parsing at 0.149 average edit distance.
Gemma 4 26B A4B is available on DeepInfra via an OpenAI-compatible API — no infrastructure setup required. Swap in your DeepInfra token and the model identifier, and your existing OpenAI client code works as-is.
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-d '{
"model": "google/gemma-4-26B-A4B-it",
"messages": [
{"role": "user", "content": "Explain the MoE architecture in plain terms."}
]
}'from openai import OpenAI
client = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
response = client.chat.completions.create(
model="google/gemma-4-26B-A4B-it",
messages=[{"role": "user", "content": "Explain the MoE architecture in plain terms."}],
)
print(response.choices[0].message.content)import OpenAI from "openai";
const openai = new OpenAI({
apiKey: "$DEEPINFRA_TOKEN",
baseURL: "https://api.deepinfra.com/v1/openai",
});
const response = await openai.chat.completions.create({
model: "google/gemma-4-26B-A4B-it",
messages: [{ role: "user", content: "Explain the MoE architecture in plain terms." }],
});
console.log(response.choices[0].message.content);To enable thinking mode, include the <|think|> token in your system prompt. For multimodal inputs, place image content before text in the message payload. Recommended sampling defaults: temperature=1.0, top_p=0.95, top_k=64.
The full API reference is available at deepinfra.com/google/gemma-4-26B-A4B-it/api. You can also try the model directly in the browser at the interactive demo.
Key Parameters
| Parameter | Description |
|---|---|
| max_new_tokens | Limits the length of the generated output. |
| temperature | Controls response randomness (recommended: 1.0 for thinking mode). |
| stream | If true, sends partial deltas as server-sent events. |
Gemma 4 on DeepInfra uses usage-based pricing calculated per 1 million tokens:
| Token Type | Price per 1M Tokens |
|---|---|
| Input Tokens | $0.07 |
| Output Tokens | $0.34 |
For current rates, tier-based discounts, and model comparisons, visit the DeepInfra pricing page.
Gemma 4 delivers a meaningful generational step in reasoning, multimodal capability, and context length — all under Apache 2.0 licensing with no restrictions on commercial use or modification. The 26B A4B MoE variant is the most practical choice for most production workloads: near-flagship benchmark performance at inference speeds closer to a 4B dense model, at $0.07/1M input and $0.34/1M output tokens on DeepInfra.
To get started, visit deepinfra.com/google/gemma-4-26B-A4B-it to try the demo or access the API. For teams evaluating the broader model landscape, the full model catalog and the Gemma 4 pricing guide cover the full picture.
Kimi K2.6 Pricing Guide 2026: Compare Costs & Deployment Strategies<p>Kimi K2.6 matters because it sits in a rare spot: open weights, broad provider availability, and a real spread in pricing and runtime performance depending on where you buy it. Artificial Analysis tracks the model across nine API providers, with blended pricing ranging from $1.15 to $2.15 per 1M tokens and major differences in throughput […]</p>
MiMo-V2.5 Provider Pricing and Deployment Guide<p>MiMo-V2.5 is worth paying attention to because it puts three things developers usually have to trade off into the same conversation: open weights, a 1 million-token model design, and pricing that can be unusually low depending on where you buy it. On Xiaomi’s first-party API, Artificial Analysis lists MiMo-V2.5 at $0.14 per 1M input tokens […]</p>
Best SaaS Tools and API Providers for MiMo-V2.5<p>As LLM architectures grow increasingly complex, the introduction of the MiMo-V2.5 series represents a significant step forward in multimodal capabilities and massive context handling. Integrating a model with a 1M-token context window and native multimodal support (image, video, audio, text) introduces substantial infrastructure considerations. For developers and enterprise architects, the priorities are clear: managing inference […]</p>
© 2026 DeepInfra. All rights reserved.