GLM-5.1 on DeepInfra: Z.AI’s Agentic Engineering Model

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.05.25 by DeepInfra

Z.AI’s GLM-5.1 scores 58.4 on SWE-Bench Pro — ahead of both Claude Opus 4.6 (57.3) and GPT-5.4 (57.7) on real-world software engineering tasks. It’s the direct successor to GLM-5, designed for agentic engineering: long-horizon coding tasks, terminal operations, and repository-level work. The core design premise is that previous models, including GLM-5, tend to plateau after their initial gains — GLM-5.1 is built to keep improving across hundreds of rounds and thousands of tool calls.

What makes that architectural choice meaningful in practice is the model’s capacity for iterative strategy revision: breaking down ambiguous problems, running experiments, reading results, and identifying blockers rather than burning through a fixed repertoire early. It carries a 202,752-token context window, supports function calling and JSON natively, and ships under an MIT license — a meaningful detail for teams thinking about deployment flexibility. At $1.05 per million input tokens and $3.50 per million output tokens, it sits at a competitive price point relative to the frontier models it benchmarks against. It’s now available on DeepInfra.

What Makes This Model Different

GLM-5.1 is Z.AI’s successor to GLM-5, built around a specific thesis: most models hit a performance ceiling on long-running agentic tasks and then stall. GLM-5.1 is explicitly designed to keep improving as it’s given more time — sustaining performance across hundreds of rounds and thousands of tool calls rather than exhausting its strategy early.

The clearest evidence shows up in coding and terminal benchmarks, where GLM-5.1 pulls ahead of its predecessor by meaningful margins:

Benchmark	GLM-5.1	GLM-5	Notable Comparisons
SWE-Bench Pro	58.4	55.1	Claude Opus 4.6: 57.3, GPT-5.4: 57.7
NL2Repo	42.7	35.9	Claude Opus 4.6: 49.8, GPT-5.4: 41.3
Terminal-Bench 2.0	63.5	56.2	Claude Opus 4.6: 65.4
CyberGym	68.7	48.3	Claude Opus 4.6: 66.6

On SWE-Bench Pro and NL2Repo, GLM-5.1 lands ahead of both Claude Opus 4.6 and GPT-5.4. CyberGym sees the most dramatic jump: from 48.3 to 68.7, beating Claude Opus 4.6’s 66.6. GLM-5.1 is also available on NVIDIA’s build platform, which gives you another access path if you’re already working within that ecosystem.

On general reasoning, the gains are more modest. GPQA-Diamond moves from 86.0 to 86.2, math benchmarks are roughly flat or slightly down (HMMT Nov: 96.9 → 94.0), and HLE with tools goes from 50.4 to 52.3. The model is tuned for agentic work, not pure reasoning competitions. GLM-5.1 also scores 79.3 on BrowseComp with context management enabled, ahead of DeepSeek-V3.2 (51.4) and competitive with other top-tier models.

The model supports a 202,752-token context window with JSON and function calling — both required for real tool-use pipelines. It handles English and Chinese, is MIT-licensed, and is served in fp4 quantization on DeepInfra under zai-org/GLM-5.1. If you want to understand the broader GLM model lineage, the GLM-4.5 blog post covers the foundation model that preceded this generation.

Getting Started on DeepInfra

GLM-5.1 is available now on DeepInfra under the identifier zai-org/GLM-5.1 as a public endpoint. Pricing is usage-based: $1.05 per 1M input tokens, $3.50 per 1M output tokens, and $0.205 per 1M cached tokens. Private endpoint deployment is also supported if you need dedicated capacity — configure that directly from the DeepInfra dashboard.

DeepInfra gives you access to GLM-5.1 through an OpenAI-compatible API with zero infrastructure setup. DeepInfra operates with a zero-retention policy and is SOC 2 and ISO 27001 certified. If you’re planning to use GLM-5.1 for production coding workflows — Claude Code, Kilo Code, Cline, or similar tools — the GLM Coding Plan is worth reviewing for team-level access options.

To make your first call, grab your API key from the Dashboard and swap in the model identifier:

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
      "model": "zai-org/GLM-5.1",
      "messages": [
        {
          "role": "user",
          "content": "Hello!"
        }
      ]
    }'copy

from openai import OpenAI


client = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)


response = client.chat.completions.create(
    model="zai-org/GLM-5.1",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)copy

import OpenAI from "openai";


const openai = new OpenAI({
  apiKey: "$DEEPINFRA_TOKEN",
  baseURL: "https://api.deepinfra.com/v1/openai",
});


const response = await openai.chat.completions.create({
  model: "zai-org/GLM-5.1",
  messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);copy

The only things that change from a standard OpenAI call are the base URL (https://api.deepinfra.com/v1/openai), your DeepInfra token, and the model name — the official OpenAI Python and Node.js SDKs work without any modifications. Head to deepinfra.com/zai-org/GLM-5.1 to start building.

Conclusion

GLM-5.1 makes a credible case for itself in the scenarios where agentic models tend to break down — long-running tasks, messy repositories, and multi-step terminal workflows that demand sustained reasoning rather than a single flash of capability. The benchmark numbers against Claude Opus 4.6 and GPT-5.4 aren’t cherry-picked narrow wins; they reflect a model that was deliberately tuned for the kind of work developers actually need to automate.

That opens up real engineering applications: autonomous PR triage pipelines, self-directed debugging agents, or repo-scale refactoring tools that don’t fall apart midway through. If any of that maps to what you’re building, GLM-5.1 is worth running through your eval pipeline. It’s also worth keeping in mind that “agentic model” here means something specific — not just a model with tool access, but one designed around the generalized linear structure of iterative, multi-step problem solving that real engineering tasks actually demand. Head to deepinfra.com/zai-org/GLM-5.1 to get started.

Chat with books using DeepInfra and LlamaIndexAs DeepInfra, we are excited to announce our integration with LlamaIndex. LlamaIndex is a powerful library that allows you to index and search documents using various language models and embeddings. In this blog post, we will show you how to chat with books using DeepInfra and LlamaIndex. We will ...

NVIDIA Nemotron 3 Super 120B API Benchmarks: Latency & Cost<p>About NVIDIA Nemotron 3 Super 120B A12B NVIDIA’s Nemotron 3 Super 120B A12B is an open-weight large language model released on March 11, 2026. It features 120B total parameters with only 12B active per forward pass, delivering exceptional compute efficiency for complex multi-agent applications such as software development and cybersecurity triaging. The model uses a […]</p>

Step 3.5 Flash API Benchmarks: Latency, Throughput & Cost<p>About Step 3.5 Flash Step 3.5 Flash is an open-weights reasoning model released in February 2026 by StepFun. It leverages a sparse Mixture of Experts (MoE) architecture with 196 billion total parameters and only 11 billion active parameters per token during inference — delivering state-of-the-art performance at a fraction of the cost of dense models. […]</p>

View all