We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Gemma 4 Model Overview: Features, Architecture & Use Cases
Published on 2026.05.25 by DeepInfra
Gemma 4 Model Overview: Features, Architecture & Use Cases

Gemma 4 is Google DeepMind’s latest family of open-weight models, released on April 3, 2026 under the Apache 2.0 license. The family spans four model sizes — from edge-optimized variants for mobile devices to a 31B dense model for server-side deployments — with every model supporting multimodal input, built-in reasoning, and a context window of up to 256K tokens. The 26B A4B Mixture-of-Experts variant and the 31B dense model are both available on DeepInfra.

Architecture

Model Variants

Gemma 4 is available in four sizes, each targeting a different deployment context:

  • E2B and E4B: Edge-optimized models using Per-Layer Embeddings (PLE) to maximize efficiency on laptops and mobile devices. Support 128K context and native audio processing for ASR and translation.
  • 26B A4B (MoE): A Mixture-of-Experts model with 25.2B total parameters, 3.8B active during inference. Runs at roughly the speed of a 4B dense model while drawing on the full 25.2B parameter knowledge base. Supports 256K context, text and image input.
  • 31B Dense: The flagship model for maximum reasoning depth and complex server-side tasks. Supports 256K context.

Attention and Context

All Gemma 4 models use a hybrid attention mechanism that interleaves local sliding window attention with full global attention. Global attention layers apply Proportional RoPE (p-RoPE) to keep memory overhead manageable for long-context tasks. The 26B A4B and 31B models support a 256K token context window; E2B and E4B support 128K.

Reasoning and Multimodality

All models include a built-in reasoning engine that allows step-by-step processing before generating a final response, triggered via the <|think|> token or the enable_thinking parameter. Native system prompt support gives developers control over model behavior and conversational structure.

Gemma 4 processes interleaved text and images with variable aspect ratio and resolution support across all model sizes. Video analysis (up to 60 seconds at 1fps) and native audio processing are supported on E2B and E4B variants. The 26B A4B model handles text and image inputs. All models cover 140+ languages.

The model family is compatible with JAX, Keras, Unsloth, and standard transformers fine-tuning frameworks.

Performance and Benchmarks

Benchmark results below are for the instruction-tuned 26B A4B variant with thinking enabled, unless noted.

26B A4B — Standard Benchmarks

BenchmarkCategoryScore
MMLU ProGeneral Knowledge82.6%
GPQA DiamondScience Reasoning82.3%
AIME 2026 (no tools)Mathematics88.3%
LiveCodeBench v6Coding77.1%
MMMLUMultilingual86.3%
MMMU ProMultimodal (Vision)73.8%

Generational Comparison

BenchmarkGemma 4 31B (Dense)Gemma 4 26B A4B (MoE)Gemma 3 27B
MMLU Pro85.2%82.6%67.6%
AIME 202689.2%88.3%20.8%
LiveCodeBench v680.0%77.1%29.1%
MMMU Pro76.9%73.8%49.7%

The jump from Gemma 3 27B is most pronounced in math (20.8% → 88.3% on AIME 2026), coding (29.1% → 77.1% on LiveCodeBench v6), and agentic tool use. The dense 31B model wins on raw quality across benchmarks but by narrow margins — typically 2–4 points. Additional results: Codeforces ELO 1718; OmniDocBench 1.5 document parsing at 0.149 average edit distance.

Getting Started on DeepInfra

Gemma 4 26B A4B is available on DeepInfra via an OpenAI-compatible API — no infrastructure setup required. Swap in your DeepInfra token and the model identifier, and your existing OpenAI client code works as-is.

  • Base URL: https://api.deepinfra.com/v1/openai
  • Model identifier: google/gemma-4-26B-A4B-it
  • Authentication: Authorization: Bearer $DEEPINFRA_TOKEN
  • Supports: JSON mode, function calling, streaming, multimodal input (text + image)
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
      "model": "google/gemma-4-26B-A4B-it",
      "messages": [
        {"role": "user", "content": "Explain the MoE architecture in plain terms."}
      ]
    }'
copy
from openai import OpenAI


client = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)


response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[{"role": "user", "content": "Explain the MoE architecture in plain terms."}],
)
print(response.choices[0].message.content)
copy
import OpenAI from "openai";


const openai = new OpenAI({
  apiKey: "$DEEPINFRA_TOKEN",
  baseURL: "https://api.deepinfra.com/v1/openai",
});


const response = await openai.chat.completions.create({
  model: "google/gemma-4-26B-A4B-it",
  messages: [{ role: "user", content: "Explain the MoE architecture in plain terms." }],
});
console.log(response.choices[0].message.content);
copy

To enable thinking mode, include the <|think|> token in your system prompt. For multimodal inputs, place image content before text in the message payload. Recommended sampling defaults: temperature=1.0, top_p=0.95, top_k=64.

The full API reference is available at deepinfra.com/google/gemma-4-26B-A4B-it/api. You can also try the model directly in the browser at the interactive demo.

Key Parameters

ParameterDescription
max_new_tokensLimits the length of the generated output.
temperatureControls response randomness (recommended: 1.0 for thinking mode).
streamIf true, sends partial deltas as server-sent events.

Pricing

Gemma 4 on DeepInfra uses usage-based pricing calculated per 1 million tokens:

Token TypePrice per 1M Tokens
Input Tokens$0.07
Output Tokens$0.34

For current rates, tier-based discounts, and model comparisons, visit the DeepInfra pricing page.

Conclusion

Gemma 4 delivers a meaningful generational step in reasoning, multimodal capability, and context length — all under Apache 2.0 licensing with no restrictions on commercial use or modification. The 26B A4B MoE variant is the most practical choice for most production workloads: near-flagship benchmark performance at inference speeds closer to a 4B dense model, at $0.07/1M input and $0.34/1M output tokens on DeepInfra.

To get started, visit deepinfra.com/google/gemma-4-26B-A4B-it to try the demo or access the API. For teams evaluating the broader model landscape, the full model catalog and the Gemma 4 pricing guide cover the full picture.

Related articles
GLM-5 API Benchmarks: Latency, Throughput & CostGLM-5 API Benchmarks: Latency, Throughput & Cost<p>GLM-5 is the latest open-weights reasoning model released by Z AI (Zhipu AI) in February 2026, characterized by high &#8220;thinking token&#8221; usage. It is a Mixture of Experts (MoE) model with 744B total parameters and 40B active parameters, scaling up from GLM-4.5&#8217;s 355B parameters. The model was pre-trained on 28.5T tokens and features a 200K+ [&hellip;]</p>
How to use CivitAI LoRAs: 5-Minute AI Guide to Stunning Double Exposure ArtHow to use CivitAI LoRAs: 5-Minute AI Guide to Stunning Double Exposure ArtLearn how to create mesmerizing double exposure art in minutes using AI. This guide shows you how to set up a LoRA model from CivitAI and create stunning artistic compositions that blend multiple images into dreamlike masterpieces.
NVIDIA Nemotron API Pricing Guide 2026NVIDIA Nemotron API Pricing Guide 2026<p>While everyone knows Llama 3 and Qwen, a quieter revolution has been happening in NVIDIA&#8217;s labs. They have been taking standard Llama models and &#8220;supercharging&#8221; them using advanced alignment techniques and pruning methods. The result is Nemotron—a family of models that frequently tops the &#8220;Helpfulness&#8221; leaderboards (like Arena Hard), often beating GPT-4o while being significantly [&hellip;]</p>