DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Gemma 4 is Google DeepMind’s latest family of open-weight models, released on April 3, 2026 under the Apache 2.0 license. The family spans four model sizes — from edge-optimized variants for mobile devices to a 31B dense model for server-side deployments — with every model supporting multimodal input, built-in reasoning, and a context window of up to 256K tokens. The 26B A4B Mixture-of-Experts variant and the 31B dense model are both available on DeepInfra.
Model Variants
Gemma 4 is available in four sizes, each targeting a different deployment context:
Attention and Context
All Gemma 4 models use a hybrid attention mechanism that interleaves local sliding window attention with full global attention. Global attention layers apply Proportional RoPE (p-RoPE) to keep memory overhead manageable for long-context tasks. The 26B A4B and 31B models support a 256K token context window; E2B and E4B support 128K.
Reasoning and Multimodality
All models include a built-in reasoning engine that allows step-by-step processing before generating a final response, triggered via the <|think|> token or the enable_thinking parameter. Native system prompt support gives developers control over model behavior and conversational structure.
Gemma 4 processes interleaved text and images with variable aspect ratio and resolution support across all model sizes. Video analysis (up to 60 seconds at 1fps) and native audio processing are supported on E2B and E4B variants. The 26B A4B model handles text and image inputs. All models cover 140+ languages.
The model family is compatible with JAX, Keras, Unsloth, and standard transformers fine-tuning frameworks.
Benchmark results below are for the instruction-tuned 26B A4B variant with thinking enabled, unless noted.
26B A4B — Standard Benchmarks
| Benchmark | Category | Score |
|---|---|---|
| MMLU Pro | General Knowledge | 82.6% |
| GPQA Diamond | Science Reasoning | 82.3% |
| AIME 2026 (no tools) | Mathematics | 88.3% |
| LiveCodeBench v6 | Coding | 77.1% |
| MMMLU | Multilingual | 86.3% |
| MMMU Pro | Multimodal (Vision) | 73.8% |
Generational Comparison
| Benchmark | Gemma 4 31B (Dense) | Gemma 4 26B A4B (MoE) | Gemma 3 27B |
|---|---|---|---|
| MMLU Pro | 85.2% | 82.6% | 67.6% |
| AIME 2026 | 89.2% | 88.3% | 20.8% |
| LiveCodeBench v6 | 80.0% | 77.1% | 29.1% |
| MMMU Pro | 76.9% | 73.8% | 49.7% |
The jump from Gemma 3 27B is most pronounced in math (20.8% → 88.3% on AIME 2026), coding (29.1% → 77.1% on LiveCodeBench v6), and agentic tool use. The dense 31B model wins on raw quality across benchmarks but by narrow margins — typically 2–4 points. Additional results: Codeforces ELO 1718; OmniDocBench 1.5 document parsing at 0.149 average edit distance.
Gemma 4 26B A4B is available on DeepInfra via an OpenAI-compatible API — no infrastructure setup required. Swap in your DeepInfra token and the model identifier, and your existing OpenAI client code works as-is.
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-d '{
"model": "google/gemma-4-26B-A4B-it",
"messages": [
{"role": "user", "content": "Explain the MoE architecture in plain terms."}
]
}'from openai import OpenAI
client = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
response = client.chat.completions.create(
model="google/gemma-4-26B-A4B-it",
messages=[{"role": "user", "content": "Explain the MoE architecture in plain terms."}],
)
print(response.choices[0].message.content)import OpenAI from "openai";
const openai = new OpenAI({
apiKey: "$DEEPINFRA_TOKEN",
baseURL: "https://api.deepinfra.com/v1/openai",
});
const response = await openai.chat.completions.create({
model: "google/gemma-4-26B-A4B-it",
messages: [{ role: "user", content: "Explain the MoE architecture in plain terms." }],
});
console.log(response.choices[0].message.content);To enable thinking mode, include the <|think|> token in your system prompt. For multimodal inputs, place image content before text in the message payload. Recommended sampling defaults: temperature=1.0, top_p=0.95, top_k=64.
The full API reference is available at deepinfra.com/google/gemma-4-26B-A4B-it/api. You can also try the model directly in the browser at the interactive demo.
Key Parameters
| Parameter | Description |
|---|---|
| max_new_tokens | Limits the length of the generated output. |
| temperature | Controls response randomness (recommended: 1.0 for thinking mode). |
| stream | If true, sends partial deltas as server-sent events. |
Gemma 4 on DeepInfra uses usage-based pricing calculated per 1 million tokens:
| Token Type | Price per 1M Tokens |
|---|---|
| Input Tokens | $0.07 |
| Output Tokens | $0.34 |
For current rates, tier-based discounts, and model comparisons, visit the DeepInfra pricing page.
Gemma 4 delivers a meaningful generational step in reasoning, multimodal capability, and context length — all under Apache 2.0 licensing with no restrictions on commercial use or modification. The 26B A4B MoE variant is the most practical choice for most production workloads: near-flagship benchmark performance at inference speeds closer to a 4B dense model, at $0.07/1M input and $0.34/1M output tokens on DeepInfra.
To get started, visit deepinfra.com/google/gemma-4-26B-A4B-it to try the demo or access the API. For teams evaluating the broader model landscape, the full model catalog and the Gemma 4 pricing guide cover the full picture.
GLM-5 API Benchmarks: Latency, Throughput & Cost<p>GLM-5 is the latest open-weights reasoning model released by Z AI (Zhipu AI) in February 2026, characterized by high “thinking token” usage. It is a Mixture of Experts (MoE) model with 744B total parameters and 40B active parameters, scaling up from GLM-4.5’s 355B parameters. The model was pre-trained on 28.5T tokens and features a 200K+ […]</p>
How to use CivitAI LoRAs: 5-Minute AI Guide to Stunning Double Exposure ArtLearn how to create mesmerizing double exposure art in minutes using AI. This guide shows you how to set up a LoRA model from CivitAI and create stunning artistic compositions that blend multiple images into dreamlike masterpieces.
NVIDIA Nemotron API Pricing Guide 2026<p>While everyone knows Llama 3 and Qwen, a quieter revolution has been happening in NVIDIA’s labs. They have been taking standard Llama models and “supercharging” them using advanced alignment techniques and pruning methods. The result is Nemotron—a family of models that frequently tops the “Helpfulness” leaderboards (like Arena Hard), often beating GPT-4o while being significantly […]</p>
© 2026 DeepInfra. All rights reserved.