DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

GLM-5.1 is Z.AI’s next-generation flagship model for agentic engineering, released on April 7, 2026 under the MIT license. It is a 754-billion parameter Mixture-of-Experts model with 40 billion active parameters per token, a 202,752-token context window, and up to 131K output tokens. The model is the direct successor to GLM-5, designed specifically for long-horizon autonomous tasks — not just single-turn completions, but sustained, iterative workflows across hundreds of rounds and thousands of tool calls. Weights are available on Hugging Face and the model is available on DeepInfra at deepinfra.com/zai-org/GLM-5.1.
The core design principle of GLM-5.1 is endurance: the ability to keep improving across extended autonomous runs rather than plateauing after initial gains. Previous models — including GLM-5 — tend to exhaust their strategy early and stall. GLM-5.1 is built to keep revising, running experiments, reading results, and identifying blockers across the full length of a task.
Z.AI demonstrated this with two concrete examples. In one, GLM-5.1 built a complete Linux desktop environment autonomously over an 8-hour session, running 655 iterations of planning, execution, testing, and optimization. In another, it increased vector database query throughput to 6.9x the initial production baseline through sustained iterative improvement. These are not single-pass results — they reflect a model designed to improve the longer it runs.
GLM-5.1 leads on SWE-Bench Pro (58.4), ahead of GPT-5.4 (57.7) and Claude Opus 4.6 (57.3), and posts the largest improvement over its predecessor on CyberGym (+20.4 points). On general reasoning benchmarks, it is competitive but not the leader — GPT-5.4 and Gemini 3.1 Pro lead on AIME 2026 and GPQA-Diamond. The model is clearly tuned for coding and agentic execution rather than pure mathematical reasoning.
Asterisked (*) competitor scores on HLE with tools were not available from official sources and were re-evaluated by Z.AI under the same conditions used for GLM-5.1.
| Benchmark | GLM-5.1 | GLM-5 | Qwen3.6-Plus | DeepSeek-V3.2 | Claude Opus 4.6 | GPT-5.4 |
|---|---|---|---|---|---|---|
| HLE (no tools) | 31.0 | 30.5 | 28.8 | 25.1 | 36.7 | 39.8 |
| HLE (w/ tools) | 52.3 | 50.4 | 50.6 | 40.8 | 53.1* | 52.1* |
| AIME 2026 | 95.3 | 95.4 | 95.1 | 95.1 | 98.2 | 98.7 |
| GPQA-Diamond | 86.2 | 86.0 | 90.4 | 82.4 | 94.3 | 92.0 |
| SWE-Bench Pro | 58.4 | 55.1 | 56.6 | — | 54.2 | 57.7 |
| NL2Repo | 42.7 | 35.9 | 37.9 | — | 49.8 | 41.3 |
| Terminal-Bench 2.0 | 63.5 | 56.2 | 61.6 | 39.3 | 68.5 | — |
| τ³-Bench | 70.6 | 69.2 | 70.7 | 69.2 | 67.1 | 72.9 |
Additional results not in the table: CyberGym 68.7 (up from GLM-5’s 48.3, ahead of Claude Opus 4.6’s 66.6); BrowseComp 68.0 standard / 79.3 with Context Management enabled; MCP-Atlas Public 71.8; Vending Bench 2 $5,634.
GLM-5.1 is available on DeepInfra via an OpenAI-compatible API. No infrastructure setup required — swap in the model identifier and your DeepInfra token.
Example request:
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-d '{
"model": "zai-org/GLM-5.1",
"messages": [{"role": "user", "content": "Hello!"}]
}'The full API reference is available at deepinfra.com/zai-org/GLM-5.1/api.
GLM-5.1 on DeepInfra uses usage-based pricing calculated per 1 million tokens:
| Token Type | Price per 1M Tokens |
|---|---|
| Input Tokens | $1.05 |
| Output Tokens | $3.50 |
| Cached Input Tokens | $0.205 |
The cached input rate of $0.205/1M tokens is particularly relevant for agentic workloads that repeatedly send stable prefixes — system prompts, tool schemas, repo context, or persistent agent instructions. For a detailed breakdown of how token economics play out across providers and workload types, see the GLM-5.1 pricing guide.
For current pricing, visit the DeepInfra pricing page. Private endpoint deployment is also available for teams that need dedicated capacity.
Due to the 754B parameter scale, self-hosting GLM-5.1 requires significant hardware — a minimum of 1x NVIDIA HGX B200 (8x B200 GPUs) at full precision. The FP8 quantized checkpoint (zai-org/GLM-5.1-FP8) is the recommended serving target, reducing memory requirements to approximately 860GB while preserving output quality for production workloads. Supported inference engines are vLLM (v0.19.0+) and SGLang (v0.5.10+). The MIT license covers commercial deployment and fine-tuning without restrictions.
GLM-5.1 is the strongest open-weight choice for developers building long-horizon autonomous coding agents. Its benchmark results on SWE-Bench Pro, CyberGym, and NL2Repo reflect a model that was deliberately tuned for the kind of iterative, multi-step engineering work that most coding models struggle to sustain. The trade-offs are real — GPT-5.4 and Gemini 3.1 Pro lead on pure reasoning benchmarks, and the context window (203K) is smaller than some proprietary alternatives — but for agentic coding workflows, the combination of open weights, MIT licensing, and sustained long-run performance makes a credible case.
Visit deepinfra.com/zai-org/GLM-5.1 to try the demo, review API documentation, or grab your API key and start building. For context on how GLM-5.1 compares against other models in the GLM family, the GLM-5 API benchmarks and GLM-4.6 vs DeepSeek-V3.2 comparison are useful reference points.
Qwen3.5 35B A3B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 35B A3B Qwen3.5 35B A3B is a native vision-language model released by Alibaba Cloud in February 2026. It uses a hybrid architecture that integrates Gated Delta Networks with a sparse Mixture-of-Experts model, achieving higher inference efficiency. With 35 billion total parameters and only 3 billion activated per token through 256 experts (8 routed […]</p>
Gemma 4 Model Overview: Features, Architecture & Use Cases<p>Gemma 4 is Google DeepMind’s latest family of open-weight models, released on April 3, 2026 under the Apache 2.0 license. The family spans four model sizes — from edge-optimized variants for mobile devices to a 31B dense model for server-side deployments — with every model supporting multimodal input, built-in reasoning, and a context window of […]</p>
Nemotron 3 Super Provider Pricing Comparison (2026)<p>Nemotron 3 Super is available from multiple providers, and the price spread is real: OpenRouter lists $0.09/$0.45 per 1M input/output tokens, DeepInfra lists $0.10/$0.50, and the Artificial Analysis median across all providers sits at $0.30/$0.75. The right provider depends on what your workload actually looks like — context requirements, output verbosity, and whether you need […]</p>
© 2026 DeepInfra. All rights reserved.