Introducing GLM-5.2 on DeepInfra

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.07.01 by DeepInfra

GLM-5.2 is Z-AI’s latest flagship model, built around one core capability: a stable, 1,048,576-token context window designed for long-horizon tasks. Most million-token context claims come with practical asterisks — degraded retrieval, inconsistent behavior at range. Z-AI describes this as the first time that scale has been delivered with reliability for sustained, long-horizon work. The coding improvements over its predecessor make the same point in numbers: DeepSWE goes from 18 to 46.2 between GLM-5.1 and GLM-5.2.

The architecture behind that context window is new. GLM-5.2 introduces IndexShare, which reuses the same indexer across every four sparse attention layers, cutting per-token FLOPs by 2.9× at 1M context length — meaning the longer window doesn’t come with the proportional compute cost you’d expect. The model also ships under an MIT license with no regional restrictions, which puts it in a different category from most models competing at this benchmark tier. It’s now available on DeepInfra under zai-org/GLM-5.2.

What Makes This Model Different

GLM-5.2 is Z-AI’s follow-up to GLM-5.1, and the headline upgrade is a stable 1,048,576-token (1M) context window. Long context support isn’t new, but reliable performance at that scale for long-horizon tasks is harder to deliver than it sounds. The key enabler here is IndexShare, a new architectural design that reuses the same indexer across every four sparse attention layers, cutting per-token FLOPs by 2.9× at 1M context length — a meaningful reduction that makes running very long contexts practical rather than theoretical.

On the inference side, GLM-5.2 ships with an improved multi-token prediction (MTP) layer that increases speculative decoding acceptance length by up to 20% over GLM-5.1. Combined with flexible thinking effort levels for coding tasks — letting you trade latency for quality depending on the task — the model gives developers real levers to tune behavior. For a detailed breakdown of how GLM-5.1 approached agentic engineering before this release, the GLM-5.1 model overview covers the architecture and design decisions that carried forward.

Benchmark improvements over GLM-5.1 are substantial across the board:

Benchmark	GLM-5.1	GLM-5.2	Δ
HLE	31.0	40.5	+9.5
GPQA-Diamond	86.2	91.2	+5.0
AIME 2026	95.3	99.2	+3.9
SWE-bench Pro	58.4	62.1	+3.7
DeepSWE	18.0	46.2	+28.2
FrontierSWE (Dominance)	30.5	74.4	+43.9
Terminal Bench 2.1	63.5	81.0	+17.5
MCP-Atlas (Public)	71.8	76.8	+5.0

The coding gains are where the delta is hardest to ignore. DeepSWE jumps from 18 to 46.2, and FrontierSWE Dominance goes from 30.5 to 74.4 — both suggesting a meaningful shift in how the model handles real-world software engineering tasks, not just benchmark tuning. GLM-5.2 is competitive with DeepSeek-V4-Pro and Qwen3.7-Max across most categories, though it trails Claude Opus 4.8 on SWE-bench Pro (62.1 vs. 69.2).

On capabilities, GLM-5.2 supports function calling and structured JSON output, making it straightforward to drop into agentic pipelines. It handles English and Chinese natively and is available under an MIT license with no regional restrictions. If you want to compare against other available options, the full models catalog has context length, pricing, and capability details across providers.

Getting Started on DeepInfra

GLM-5.2 is available on DeepInfra under the identifier zai-org/GLM-5.2. Pricing is usage-based: Standard Tier runs $0.95 per 1M input tokens and $3.00 per 1M output tokens, with cached input at $0.18 per 1M tokens. If you need guaranteed throughput, Priority Tier is available at 1.5× those rates ($1.425 / $4.50 / $0.27). Private endpoint deployment is also supported for dedicated infrastructure. For a closer look at how GLM-5.1 pricing stacked up across providers before this release, the GLM-5.1 pricing guide gives useful context on where DeepInfra sits in the market.

Access is through a fully OpenAI-compatible API — no infrastructure to manage, no containers to spin up. DeepInfra operates with a zero data-retention policy and is SOC 2 and ISO 27001 certified. If you want to understand the latency and throughput characteristics before committing, the GLM-5.1 API benchmarks offer a reasonable proxy while GLM-5.2-specific numbers are published.

Here’s everything you need to make your first call:

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
      "model": "zai-org/GLM-5.2",
      "messages": [
        {
          "role": "user",
          "content": "Hello!"
        }
      ]
    }'copy

from openai import OpenAI


client = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)


response = client.chat.completions.create(
    model="zai-org/GLM-5.2",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)copy

import OpenAI from "openai";


const openai = new OpenAI({
  apiKey: "$DEEPINFRA_TOKEN",
  baseURL: "https://api.deepinfra.com/v1/openai",
});


const response = await openai.chat.completions.create({
  model: "zai-org/GLM-5.2",
  messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);copy

The only things that differ from a standard OpenAI call are the base URL (https://api.deepinfra.com/v1/openai), your DeepInfra token, and the model name. The official OpenAI Python and Node.js SDKs work without modification. GLM-5.2 also supports JSON output mode and function calling out of the box, so tool-use workflows slot in without any extra wiring. You can explore the full GLM-5.2 API reference for parameter details, supported endpoints, and response schemas.

For voice-enabled use cases, GLM-5.2 voice is also available on DeepInfra — worth noting if your pipeline involves audio I/O alongside the text and tool-use workflows.

Conclusion

GLM-5.2 is worth evaluating on a few concrete grounds: a million-token context window that holds up under load, coding benchmark gains that look more like a capability jump than incremental tuning, and an MIT license that removes the friction you’d normally expect at this tier. For developers building document-heavy pipelines, long-running agents, or multi-step coding workflows, those properties are practically useful rather than just impressive on paper.

If you’ve been waiting for a high-context model you can deploy freely and wire into agentic tooling without fighting the license, this is a reasonable place to start. Head to the GLM-5.2 demo to run a few calls and see how it handles your workload.

DeepSeek V3.2 API Benchmarks: Latency, Throughput & CostAbout DeepSeek V3.2 DeepSeek V3.2 is a state-of-the-art large language model that unifies conversational speed and deep reasoning in a single 685B parameter Mixture of Experts (MoE) architecture with 37B parameters activated per token. It is built around three key technical breakthroughs: DeepSeek V3.2 achieved gold-medal performance in the 2025 International Mathematical Olympiad (IMO) and […]

Best SaaS Tools and API Providers for MiMo-V2.5As LLM architectures grow increasingly complex, the introduction of the MiMo-V2.5 series represents a significant step forward in multimodal capabilities and massive context handling. Integrating a model with a 1M-token context window and native multimodal support (image, video, audio, text) introduces substantial infrastructure considerations. For developers and enterprise architects, the priorities are clear: managing inference […]

Open vs Closed Source AI Models: Intelligence, Price & Speed ComparedThe LLM landscape in 2026 looks nothing like it did two years ago. Back then the assumption was simple: if you wanted the best model, you paid OpenAI or Anthropic, and that was that. Open source models were a respectable second tier, good for experimentation, fine-tuning, and budget workloads, but not quite there for serious […]

View all