DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Gemma 4 is Google DeepMind’s latest family of open-weight models, released on April 3, 2026 under the Apache 2.0 license. The family spans four model sizes — from edge-optimized variants for mobile devices to a 31B dense model for server-side deployments — with every model supporting multimodal input, built-in reasoning, and a context window of up to 256K tokens. The 26B A4B Mixture-of-Experts variant and the 31B dense model are both available on DeepInfra.
Model Variants
Gemma 4 is available in four sizes, each targeting a different deployment context:
Attention and Context
All Gemma 4 models use a hybrid attention mechanism that interleaves local sliding window attention with full global attention. Global attention layers apply Proportional RoPE (p-RoPE) to keep memory overhead manageable for long-context tasks. The 26B A4B and 31B models support a 256K token context window; E2B and E4B support 128K.
Reasoning and Multimodality
All models include a built-in reasoning engine that allows step-by-step processing before generating a final response, triggered via the <|think|> token or the enable_thinking parameter. Native system prompt support gives developers control over model behavior and conversational structure.
Gemma 4 processes interleaved text and images with variable aspect ratio and resolution support across all model sizes. Video analysis (up to 60 seconds at 1fps) and native audio processing are supported on E2B and E4B variants. The 26B A4B model handles text and image inputs. All models cover 140+ languages.
The model family is compatible with JAX, Keras, Unsloth, and standard transformers fine-tuning frameworks.
Benchmark results below are for the instruction-tuned 26B A4B variant with thinking enabled, unless noted.
26B A4B — Standard Benchmarks
| Benchmark | Category | Score |
|---|---|---|
| MMLU Pro | General Knowledge | 82.6% |
| GPQA Diamond | Science Reasoning | 82.3% |
| AIME 2026 (no tools) | Mathematics | 88.3% |
| LiveCodeBench v6 | Coding | 77.1% |
| MMMLU | Multilingual | 86.3% |
| MMMU Pro | Multimodal (Vision) | 73.8% |
Generational Comparison
| Benchmark | Gemma 4 31B (Dense) | Gemma 4 26B A4B (MoE) | Gemma 3 27B |
|---|---|---|---|
| MMLU Pro | 85.2% | 82.6% | 67.6% |
| AIME 2026 | 89.2% | 88.3% | 20.8% |
| LiveCodeBench v6 | 80.0% | 77.1% | 29.1% |
| MMMU Pro | 76.9% | 73.8% | 49.7% |
The jump from Gemma 3 27B is most pronounced in math (20.8% → 88.3% on AIME 2026), coding (29.1% → 77.1% on LiveCodeBench v6), and agentic tool use. The dense 31B model wins on raw quality across benchmarks but by narrow margins — typically 2–4 points. Additional results: Codeforces ELO 1718; OmniDocBench 1.5 document parsing at 0.149 average edit distance.
Gemma 4 26B A4B is available on DeepInfra via an OpenAI-compatible API — no infrastructure setup required. Swap in your DeepInfra token and the model identifier, and your existing OpenAI client code works as-is.
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-d '{
"model": "google/gemma-4-26B-A4B-it",
"messages": [
{"role": "user", "content": "Explain the MoE architecture in plain terms."}
]
}'from openai import OpenAI
client = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
response = client.chat.completions.create(
model="google/gemma-4-26B-A4B-it",
messages=[{"role": "user", "content": "Explain the MoE architecture in plain terms."}],
)
print(response.choices[0].message.content)import OpenAI from "openai";
const openai = new OpenAI({
apiKey: "$DEEPINFRA_TOKEN",
baseURL: "https://api.deepinfra.com/v1/openai",
});
const response = await openai.chat.completions.create({
model: "google/gemma-4-26B-A4B-it",
messages: [{ role: "user", content: "Explain the MoE architecture in plain terms." }],
});
console.log(response.choices[0].message.content);To enable thinking mode, include the <|think|> token in your system prompt. For multimodal inputs, place image content before text in the message payload. Recommended sampling defaults: temperature=1.0, top_p=0.95, top_k=64.
The full API reference is available at deepinfra.com/google/gemma-4-26B-A4B-it/api. You can also try the model directly in the browser at the interactive demo.
Key Parameters
| Parameter | Description |
|---|---|
| max_new_tokens | Limits the length of the generated output. |
| temperature | Controls response randomness (recommended: 1.0 for thinking mode). |
| stream | If true, sends partial deltas as server-sent events. |
Gemma 4 on DeepInfra uses usage-based pricing calculated per 1 million tokens:
| Token Type | Price per 1M Tokens |
|---|---|
| Input Tokens | $0.07 |
| Output Tokens | $0.34 |
For current rates, tier-based discounts, and model comparisons, visit the DeepInfra pricing page.
Gemma 4 delivers a meaningful generational step in reasoning, multimodal capability, and context length — all under Apache 2.0 licensing with no restrictions on commercial use or modification. The 26B A4B MoE variant is the most practical choice for most production workloads: near-flagship benchmark performance at inference speeds closer to a 4B dense model, at $0.07/1M input and $0.34/1M output tokens on DeepInfra.
To get started, visit deepinfra.com/google/gemma-4-26B-A4B-it to try the demo or access the API. For teams evaluating the broader model landscape, the full model catalog and the Gemma 4 pricing guide cover the full picture.
Power the Next Era of Image Generation with FLUX.2 Visual Intelligence on DeepInfraDeepInfra is excited to support FLUX.2 from day zero, bringing the newest visual intelligence model from Black Forest Labs to our platform at launch. We make it straightforward for developers, creators, and enterprises to run the model with high performance, transparent pricing, and an API designed for productivity.
Pricing 101: Token Math & Cost-Per-Completion Explained<p>LLM pricing can feel opaque until you translate it into a few simple numbers: input tokens, output tokens, and price per million. Every request you send—system prompt, chat history, RAG context, tool-call JSON—counts as input; everything the model writes back counts as output. Once you know those two counts, the cost of a completion is […]</p>
GLM-4.6 API: Get fast first tokens at the best $/M from Deepinfra's API - Deep Infra<p>GLM-4.6 is a high-capacity, “reasoning”-tuned model that shows up in coding copilots, long-context RAG, and multi-tool agent loops. With this class of workload, provider infrastructure determines perceived speed (first-token time), tail stability, and your unit economics. Using ArtificialAnalysis (AA) provider charts for GLM-4.6 (Reasoning), DeepInfra (FP8) pairs a sub-second Time-to-First-Token (TTFT) (0.51 s) with the […]</p>
© 2026 DeepInfra. All rights reserved.