Gemma 4 Model Overview: Features, Architecture & Use Cases

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.05.25 by DeepInfra

Gemma 4 is Google DeepMind’s latest family of open-weight models, released on April 3, 2026 under the Apache 2.0 license. The family spans four model sizes — from edge-optimized variants for mobile devices to a 31B dense model for server-side deployments — with every model supporting multimodal input, built-in reasoning, and a context window of up to 256K tokens. The 26B A4B Mixture-of-Experts variant and the 31B dense model are both available on DeepInfra.

Architecture

Model Variants

Gemma 4 is available in four sizes, each targeting a different deployment context:

E2B and E4B: Edge-optimized models using Per-Layer Embeddings (PLE) to maximize efficiency on laptops and mobile devices. Support 128K context and native audio processing for ASR and translation.
26B A4B (MoE): A Mixture-of-Experts model with 25.2B total parameters, 3.8B active during inference. Runs at roughly the speed of a 4B dense model while drawing on the full 25.2B parameter knowledge base. Supports 256K context, text and image input.
31B Dense: The flagship model for maximum reasoning depth and complex server-side tasks. Supports 256K context.

Attention and Context

All Gemma 4 models use a hybrid attention mechanism that interleaves local sliding window attention with full global attention. Global attention layers apply Proportional RoPE (p-RoPE) to keep memory overhead manageable for long-context tasks. The 26B A4B and 31B models support a 256K token context window; E2B and E4B support 128K.

Reasoning and Multimodality

All models include a built-in reasoning engine that allows step-by-step processing before generating a final response, triggered via the <|think|> token or the enable_thinking parameter. Native system prompt support gives developers control over model behavior and conversational structure.

Gemma 4 processes interleaved text and images with variable aspect ratio and resolution support across all model sizes. Video analysis (up to 60 seconds at 1fps) and native audio processing are supported on E2B and E4B variants. The 26B A4B model handles text and image inputs. All models cover 140+ languages.

The model family is compatible with JAX, Keras, Unsloth, and standard transformers fine-tuning frameworks.

Performance and Benchmarks

Benchmark results below are for the instruction-tuned 26B A4B variant with thinking enabled, unless noted.

26B A4B — Standard Benchmarks

Benchmark	Category	Score
MMLU Pro	General Knowledge	82.6%
GPQA Diamond	Science Reasoning	82.3%
AIME 2026 (no tools)	Mathematics	88.3%
LiveCodeBench v6	Coding	77.1%
MMMLU	Multilingual	86.3%
MMMU Pro	Multimodal (Vision)	73.8%

Generational Comparison

Benchmark	Gemma 4 31B (Dense)	Gemma 4 26B A4B (MoE)	Gemma 3 27B
MMLU Pro	85.2%	82.6%	67.6%
AIME 2026	89.2%	88.3%	20.8%
LiveCodeBench v6	80.0%	77.1%	29.1%
MMMU Pro	76.9%	73.8%	49.7%

The jump from Gemma 3 27B is most pronounced in math (20.8% → 88.3% on AIME 2026), coding (29.1% → 77.1% on LiveCodeBench v6), and agentic tool use. The dense 31B model wins on raw quality across benchmarks but by narrow margins — typically 2–4 points. Additional results: Codeforces ELO 1718; OmniDocBench 1.5 document parsing at 0.149 average edit distance.

Getting Started on DeepInfra

Gemma 4 26B A4B is available on DeepInfra via an OpenAI-compatible API — no infrastructure setup required. Swap in your DeepInfra token and the model identifier, and your existing OpenAI client code works as-is.

Base URL: https://api.deepinfra.com/v1/openai
Model identifier: google/gemma-4-26B-A4B-it
Authentication: Authorization: Bearer $DEEPINFRA_TOKEN
Supports: JSON mode, function calling, streaming, multimodal input (text + image)

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
      "model": "google/gemma-4-26B-A4B-it",
      "messages": [
        {"role": "user", "content": "Explain the MoE architecture in plain terms."}
      ]
    }'copy

from openai import OpenAI


client = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)


response = client.chat.completions.create(
    model="google/gemma-4-26B-A4B-it",
    messages=[{"role": "user", "content": "Explain the MoE architecture in plain terms."}],
)
print(response.choices[0].message.content)copy

import OpenAI from "openai";


const openai = new OpenAI({
  apiKey: "$DEEPINFRA_TOKEN",
  baseURL: "https://api.deepinfra.com/v1/openai",
});


const response = await openai.chat.completions.create({
  model: "google/gemma-4-26B-A4B-it",
  messages: [{ role: "user", content: "Explain the MoE architecture in plain terms." }],
});
console.log(response.choices[0].message.content);copy

To enable thinking mode, include the <|think|> token in your system prompt. For multimodal inputs, place image content before text in the message payload. Recommended sampling defaults: temperature=1.0, top_p=0.95, top_k=64.

The full API reference is available at deepinfra.com/google/gemma-4-26B-A4B-it/api. You can also try the model directly in the browser at the interactive demo.

Key Parameters

Parameter	Description
max_new_tokens	Limits the length of the generated output.
temperature	Controls response randomness (recommended: 1.0 for thinking mode).
stream	If true, sends partial deltas as server-sent events.

Pricing

Gemma 4 on DeepInfra uses usage-based pricing calculated per 1 million tokens:

Token Type	Price per 1M Tokens
Input Tokens	$0.07
Output Tokens	$0.34

For current rates, tier-based discounts, and model comparisons, visit the DeepInfra pricing page.

Conclusion

Gemma 4 delivers a meaningful generational step in reasoning, multimodal capability, and context length — all under Apache 2.0 licensing with no restrictions on commercial use or modification. The 26B A4B MoE variant is the most practical choice for most production workloads: near-flagship benchmark performance at inference speeds closer to a 4B dense model, at $0.07/1M input and $0.34/1M output tokens on DeepInfra.

To get started, visit deepinfra.com/google/gemma-4-26B-A4B-it to try the demo or access the API. For teams evaluating the broader model landscape, the full model catalog and the Gemma 4 pricing guide cover the full picture.

MiMo-V2.5 Model Documentation and Integration Guide<p>MiMo-V2.5 is a native omnimodal model developed by XiaomiMiMo, designed to process and understand text, image, video, and audio through a unified architecture rather than relying on “bolted-on” components for each modality. Built on a 310-billion-parameter Sparse Mixture of Experts (MoE) architecture — with only 15 billion parameters activated during inference — MiMo-V2.5 offers a […]</p>

Guaranteed JSON output on Open-Source LLMs.DeepInfra is proud to announce that we have released "JSON mode" across all of our text language models. It is available through the "response_format" object, which currently supports only {"type": "json_object"} Our JSON mode will guarantee that all tokens returned in the output of a langua...

Introducing the Batch API: Run Large Inference Jobs 20% CheaperDeepInfra's new Batch API lets you submit large volumes of completions, chat, and embedding requests as a single asynchronous job—processed within 24 hours at 20% off real-time pricing. It's fully OpenAI-compatible, so if you've used OpenAI's Batch API, you already know how it works.

View all