DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Z.AI’s GLM-5.1 scores 58.4 on SWE-Bench Pro — ahead of both Claude Opus 4.6 (57.3) and GPT-5.4 (57.7) on real-world software engineering tasks. It’s the direct successor to GLM-5, designed for agentic engineering: long-horizon coding tasks, terminal operations, and repository-level work. The core design premise is that previous models, including GLM-5, tend to plateau after their initial gains — GLM-5.1 is built to keep improving across hundreds of rounds and thousands of tool calls.
What makes that architectural choice meaningful in practice is the model’s capacity for iterative strategy revision: breaking down ambiguous problems, running experiments, reading results, and identifying blockers rather than burning through a fixed repertoire early. It carries a 202,752-token context window, supports function calling and JSON natively, and ships under an MIT license — a meaningful detail for teams thinking about deployment flexibility. At $1.05 per million input tokens and $3.50 per million output tokens, it sits at a competitive price point relative to the frontier models it benchmarks against. It’s now available on DeepInfra.
GLM-5.1 is Z.AI’s successor to GLM-5, built around a specific thesis: most models hit a performance ceiling on long-running agentic tasks and then stall. GLM-5.1 is explicitly designed to keep improving as it’s given more time — sustaining performance across hundreds of rounds and thousands of tool calls rather than exhausting its strategy early.
The clearest evidence shows up in coding and terminal benchmarks, where GLM-5.1 pulls ahead of its predecessor by meaningful margins:
| Benchmark | GLM-5.1 | GLM-5 | Notable Comparisons |
|---|---|---|---|
| SWE-Bench Pro | 58.4 | 55.1 | Claude Opus 4.6: 57.3, GPT-5.4: 57.7 |
| NL2Repo | 42.7 | 35.9 | Claude Opus 4.6: 49.8, GPT-5.4: 41.3 |
| Terminal-Bench 2.0 | 63.5 | 56.2 | Claude Opus 4.6: 65.4 |
| CyberGym | 68.7 | 48.3 | Claude Opus 4.6: 66.6 |
On SWE-Bench Pro and NL2Repo, GLM-5.1 lands ahead of both Claude Opus 4.6 and GPT-5.4. CyberGym sees the most dramatic jump: from 48.3 to 68.7, beating Claude Opus 4.6’s 66.6. GLM-5.1 is also available on NVIDIA’s build platform, which gives you another access path if you’re already working within that ecosystem.
On general reasoning, the gains are more modest. GPQA-Diamond moves from 86.0 to 86.2, math benchmarks are roughly flat or slightly down (HMMT Nov: 96.9 → 94.0), and HLE with tools goes from 50.4 to 52.3. The model is tuned for agentic work, not pure reasoning competitions. GLM-5.1 also scores 79.3 on BrowseComp with context management enabled, ahead of DeepSeek-V3.2 (51.4) and competitive with other top-tier models.
The model supports a 202,752-token context window with JSON and function calling — both required for real tool-use pipelines. It handles English and Chinese, is MIT-licensed, and is served in fp4 quantization on DeepInfra under zai-org/GLM-5.1. If you want to understand the broader GLM model lineage, the GLM-4.5 blog post covers the foundation model that preceded this generation.
GLM-5.1 is available now on DeepInfra under the identifier zai-org/GLM-5.1 as a public endpoint. Pricing is usage-based: $1.05 per 1M input tokens, $3.50 per 1M output tokens, and $0.205 per 1M cached tokens. Private endpoint deployment is also supported if you need dedicated capacity — configure that directly from the DeepInfra dashboard.
DeepInfra gives you access to GLM-5.1 through an OpenAI-compatible API with zero infrastructure setup. DeepInfra operates with a zero-retention policy and is SOC 2 and ISO 27001 certified. If you’re planning to use GLM-5.1 for production coding workflows — Claude Code, Kilo Code, Cline, or similar tools — the GLM Coding Plan is worth reviewing for team-level access options.
To make your first call, grab your API key from the Dashboard and swap in the model identifier:
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-d '{
"model": "zai-org/GLM-5.1",
"messages": [
{
"role": "user",
"content": "Hello!"
}
]
}'from openai import OpenAI
client = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
response = client.chat.completions.create(
model="zai-org/GLM-5.1",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)import OpenAI from "openai";
const openai = new OpenAI({
apiKey: "$DEEPINFRA_TOKEN",
baseURL: "https://api.deepinfra.com/v1/openai",
});
const response = await openai.chat.completions.create({
model: "zai-org/GLM-5.1",
messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);The only things that change from a standard OpenAI call are the base URL (https://api.deepinfra.com/v1/openai), your DeepInfra token, and the model name — the official OpenAI Python and Node.js SDKs work without any modifications. Head to deepinfra.com/zai-org/GLM-5.1 to start building.
GLM-5.1 makes a credible case for itself in the scenarios where agentic models tend to break down — long-running tasks, messy repositories, and multi-step terminal workflows that demand sustained reasoning rather than a single flash of capability. The benchmark numbers against Claude Opus 4.6 and GPT-5.4 aren’t cherry-picked narrow wins; they reflect a model that was deliberately tuned for the kind of work developers actually need to automate.
That opens up real engineering applications: autonomous PR triage pipelines, self-directed debugging agents, or repo-scale refactoring tools that don’t fall apart midway through. If any of that maps to what you’re building, GLM-5.1 is worth running through your eval pipeline. It’s also worth keeping in mind that “agentic model” here means something specific — not just a model with tool access, but one designed around the generalized linear structure of iterative, multi-step problem solving that real engineering tasks actually demand. Head to deepinfra.com/zai-org/GLM-5.1 to get started.
Kimi K2.6 is Now Available on DeepInfra<p>Kimi K2.6 can coordinate up to 300 sub-agents executing 4,000 steps in a single autonomous run — Moonshot AI’s answer to the gap between what frontier models can do in a chat window and what production agentic systems actually need. Built for long-horizon coding, deep research, and complex orchestration, the model is open source under […]</p>
DeepSeek V4 Pro: Model Overview, Features & Performance Guide<p>DeepSeek V4 Pro is a 1.6-trillion parameter Mixture-of-Experts (MoE) model from DeepSeek, released on April 24, 2026 under the MIT license. It is designed for advanced reasoning, complex software engineering, and long-running agentic tasks, and arrives alongside DeepSeek-V4-Flash, a lighter 284B-parameter variant built for faster, lower-cost inference. The V4 series is DeepSeek’s first two-tier lineup […]</p>
GLM-4.7-Flash API Benchmarks: Latency, Throughput & Cost<p>About GLM-4.7-Flash GLM-4.7-Flash is Z.AI’s open-weights reasoning model released in January 2026. Built on a Mixture-of-Experts (MoE) Transformer architecture, it features 30 billion total parameters with only ~3 billion active per inference — making it exceptionally efficient for its capability class. The model is designed as a lightweight, cost-effective alternative to Z.AI’s flagship GLM-4.7, optimized […]</p>
© 2026 DeepInfra. All rights reserved.