We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Introducing GLM-5.2 on DeepInfra
Published on 2026.07.01 by DeepInfra
Introducing GLM-5.2 on DeepInfra

GLM-5.2 is Z-AI’s latest flagship model, built around one core capability: a stable, 1,048,576-token context window designed for long-horizon tasks. Most million-token context claims come with practical asterisks — degraded retrieval, inconsistent behavior at range. Z-AI describes this as the first time that scale has been delivered with reliability for sustained, long-horizon work. The coding improvements over its predecessor make the same point in numbers: DeepSWE goes from 18 to 46.2 between GLM-5.1 and GLM-5.2.

The architecture behind that context window is new. GLM-5.2 introduces IndexShare, which reuses the same indexer across every four sparse attention layers, cutting per-token FLOPs by 2.9× at 1M context length — meaning the longer window doesn’t come with the proportional compute cost you’d expect. The model also ships under an MIT license with no regional restrictions, which puts it in a different category from most models competing at this benchmark tier. It’s now available on DeepInfra under zai-org/GLM-5.2.

What Makes This Model Different

GLM-5.2 is Z-AI’s follow-up to GLM-5.1, and the headline upgrade is a stable 1,048,576-token (1M) context window. Long context support isn’t new, but reliable performance at that scale for long-horizon tasks is harder to deliver than it sounds. The key enabler here is IndexShare, a new architectural design that reuses the same indexer across every four sparse attention layers, cutting per-token FLOPs by 2.9× at 1M context length — a meaningful reduction that makes running very long contexts practical rather than theoretical.

On the inference side, GLM-5.2 ships with an improved multi-token prediction (MTP) layer that increases speculative decoding acceptance length by up to 20% over GLM-5.1. Combined with flexible thinking effort levels for coding tasks — letting you trade latency for quality depending on the task — the model gives developers real levers to tune behavior. For a detailed breakdown of how GLM-5.1 approached agentic engineering before this release, the GLM-5.1 model overview covers the architecture and design decisions that carried forward.

Benchmark improvements over GLM-5.1 are substantial across the board:

BenchmarkGLM-5.1GLM-5.2Δ
HLE31.040.5+9.5
GPQA-Diamond86.291.2+5.0
AIME 202695.399.2+3.9
SWE-bench Pro58.462.1+3.7
DeepSWE18.046.2+28.2
FrontierSWE (Dominance)30.574.4+43.9
Terminal Bench 2.163.581.0+17.5
MCP-Atlas (Public)71.876.8+5.0

The coding gains are where the delta is hardest to ignore. DeepSWE jumps from 18 to 46.2, and FrontierSWE Dominance goes from 30.5 to 74.4 — both suggesting a meaningful shift in how the model handles real-world software engineering tasks, not just benchmark tuning. GLM-5.2 is competitive with DeepSeek-V4-Pro and Qwen3.7-Max across most categories, though it trails Claude Opus 4.8 on SWE-bench Pro (62.1 vs. 69.2).

On capabilities, GLM-5.2 supports function calling and structured JSON output, making it straightforward to drop into agentic pipelines. It handles English and Chinese natively and is available under an MIT license with no regional restrictions. If you want to compare against other available options, the full models catalog has context length, pricing, and capability details across providers.

Getting Started on DeepInfra

GLM-5.2 is available on DeepInfra under the identifier zai-org/GLM-5.2. Pricing is usage-based: Standard Tier runs $0.95 per 1M input tokens and $3.00 per 1M output tokens, with cached input at $0.18 per 1M tokens. If you need guaranteed throughput, Priority Tier is available at 1.5× those rates ($1.425 / $4.50 / $0.27). Private endpoint deployment is also supported for dedicated infrastructure. For a closer look at how GLM-5.1 pricing stacked up across providers before this release, the GLM-5.1 pricing guide gives useful context on where DeepInfra sits in the market.

Access is through a fully OpenAI-compatible API — no infrastructure to manage, no containers to spin up. DeepInfra operates with a zero data-retention policy and is SOC 2 and ISO 27001 certified. If you want to understand the latency and throughput characteristics before committing, the GLM-5.1 API benchmarks offer a reasonable proxy while GLM-5.2-specific numbers are published.

Here’s everything you need to make your first call:

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
      "model": "zai-org/GLM-5.2",
      "messages": [
        {
          "role": "user",
          "content": "Hello!"
        }
      ]
    }'
copy
from openai import OpenAI


client = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)


response = client.chat.completions.create(
    model="zai-org/GLM-5.2",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
copy
import OpenAI from "openai";


const openai = new OpenAI({
  apiKey: "$DEEPINFRA_TOKEN",
  baseURL: "https://api.deepinfra.com/v1/openai",
});


const response = await openai.chat.completions.create({
  model: "zai-org/GLM-5.2",
  messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);
copy

The only things that differ from a standard OpenAI call are the base URL (https://api.deepinfra.com/v1/openai), your DeepInfra token, and the model name. The official OpenAI Python and Node.js SDKs work without modification. GLM-5.2 also supports JSON output mode and function calling out of the box, so tool-use workflows slot in without any extra wiring. You can explore the full GLM-5.2 API reference for parameter details, supported endpoints, and response schemas.

For voice-enabled use cases, GLM-5.2 voice is also available on DeepInfra — worth noting if your pipeline involves audio I/O alongside the text and tool-use workflows.

Conclusion

GLM-5.2 is worth evaluating on a few concrete grounds: a million-token context window that holds up under load, coding benchmark gains that look more like a capability jump than incremental tuning, and an MIT license that removes the friction you’d normally expect at this tier. For developers building document-heavy pipelines, long-running agents, or multi-step coding workflows, those properties are practically useful rather than just impressive on paper.

If you’ve been waiting for a high-context model you can deploy freely and wire into agentic tooling without fighting the license, this is a reasonable place to start. Head to the GLM-5.2 demo to run a few calls and see how it handles your workload.

Related articles
Llama 3.1 70B Instruct API from DeepInfra: Snappy Starts, Fair Pricing, Production Fit - Deep InfraLlama 3.1 70B Instruct API from DeepInfra: Snappy Starts, Fair Pricing, Production Fit - Deep Infra<p>Llama 3.1 70B Instruct is Meta’s widely-used, instruction-tuned model for high-quality dialogue and tool use. With a ~131K-token context window, it can read long prompts and multi-file inputs—great for agents, RAG, and IDE assistants. But how “good” it feels in practice depends just as much on the inference provider as on the model: infra, batching, [&hellip;]</p>
Unleashing the Potential of AI for Exceptional Gaming ExperiencesUnleashing the Potential of AI for Exceptional Gaming ExperiencesGaming companies are constantly in search of ways to enhance player experiences and achieve extraordinary outcomes. Recent research indicates that investments in player experience (PX) can result in substantial returns on investment (ROI). By prioritizing PX and harnessing the capabilities of AI...
Building Efficient AI Inference on NVIDIA Blackwell PlatformBuilding Efficient AI Inference on NVIDIA Blackwell PlatformDeepInfra delivers up to 20x cost reductions on NVIDIA Blackwell by combining MoE architectures, NVFP4 quantization, and inference optimizations — with a Latitude case study.