DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

GLM-5.2 is Z.AI’s flagship open-source large language model, engineered for long-horizon coding, agentic, and reasoning tasks. Designed for complex reasoning, advanced software engineering, and large-scale data processing, GLM-5.2 introduces a massive 1,048,576-token context window alongside significant architectural innovations.
Hosted on the DeepInfra platform, GLM-5.2 provides developers with a high-performance, OpenAI-compatible interface. Whether you are building agentic workflows, analyzing entire codebases, or processing lengthy documents, GLM-5.2 offers the stability and intelligence required for next-generation AI applications.
GLM-5.2 was released on June 13, 2026, succeeding GLM-5.1 in the GLM-5 family. Unlike previous iterations, this model is engineered to maintain output quality and stability even when the 1M-token context is fully utilized, allowing for the seamless processing of large datasets and complex, multi-file repositories in a single prompt.
IndexShare and MTP: To support this context window efficiently, Z-AI introduced IndexShare, a mechanism that reuses the same indexer across every four sparse attention layers, resulting in a reported 2.9x reduction in per-token FLOPs at maximum context length. An upgraded Multi-Token Prediction (MTP) layer also optimizes speculative decoding, increasing token acceptance length by up to 20% for faster, more cost-effective generation.
Flexible Reasoning: GLM-5.2 features a “Flexible Effort” system (High and Max modes) that lets users adjust the model’s thinking depth to balance reasoning performance against latency. Z.ai recommends the Max effort level for complex, multi-step tasks.
Open Access: GLM-5.2 is released under the MIT license, allowing unrestricted commercial use, modification, and self-hosting.
GLM-5.2 demonstrates strong performance across industry-standard evaluations, frequently rivaling or approaching proprietary models such as GPT-5.5 and Claude Opus 4.8.
| Category | Benchmark | GLM-5.2 | GLM-5.1 | Qwen3.7-Max | GPT-5.5 | Claude Opus 4.8 |
|---|---|---|---|---|---|---|
| Reasoning | GPQA-Diamond | 91.2 | 86.2 | 90.0 | 93.6 | 93.6 |
| Math | AIME 2026 | 99.2 | 95.3 | 97.0 | 98.3 | 95.7 |
| IMOAnswerBench | 91.0 | 83.8 | 90.0 | — | 83.5 | |
| Coding | SWE-bench Pro | 62.1 | 58.4 | 60.6 | 58.6 | 69.2 |
| FrontierSWE | 74.4 | 30.5 | — | 72.6 | 75.1 | |
| Agentic | MCP-Atlas | 76.8 | 71.8 | 76.4 | 75.3 | 77.8 |
Key Highlights
GLM-5.2 is accessible via DeepInfra’s OpenAI-compatible API, making integration straightforward for developers familiar with standard LLM tooling.
Use your DeepInfra API key in the Authorization header:
Authorization: Bearer <YOUR_DEEPINFRA_API_KEY>
curl -X POST https://api.deepinfra.com/v1/openai/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPINFRA_API_KEY" \
-d '{
"model": "zai-org/GLM-5.2",
"messages": [
{
"role": "user",
"content": "Explain the concept of speculative decoding in 2 sentences."
}
],
"temperature": 0.7| Parameter | Type | Description |
|---|---|---|
| model | String | Use “zai-org/GLM-5.2”. |
| messages | Array | Conversation history objects. |
| response_format | Object | Set to {“type”: “json_object”} for structured JSON. |
| tools | Array | Definitions for function calling. |
| temperature | Float | Controls randomness (0.0 to 2.0). |
| max_tokens | Integer | Maximum tokens to generate in the response. |
DeepInfra offers a flexible, pay-per-token pricing model for GLM-5.2, with options for both standard and prioritized workloads.
| Tier | Input | Cached Input | Output |
|---|---|---|---|
| Standard | $0.95 / 1M tokens | $0.18 / 1M tokens | $3.00 / 1M tokens |
| Priority (1.5×) | $1.425 / 1M tokens | $0.27 / 1M tokens | $4.50 / 1M tokens |
The Priority Tier, available at 1.5× the standard rate, is designed for workloads requiring higher priority and faster processing.
While standard API access is usage-based, users with high-throughput requirements can deploy Private Endpoints via the DeepInfra Dashboard for dedicated capacity.
GLM-5.2 combines a massive 1M-token context window with strong reasoning and coding capabilities, supported by architectural innovations like IndexShare and a flexible reasoning system. It provides developers with the efficiency and power needed for complex agentic and long-horizon tasks.
To begin building with GLM-5.2, visit the DeepInfra Dashboard to generate your API key and explore private deployment options.
How to deploy google/flan-ul2 - simple. (open source ChatGPT alternative)Flan-UL2 is probably the best open source model available right now for chatbots. In this post
we will show you how to get started with it very easily. Flan-UL2 is large -
20B parameters. It is fine tuned version of the UL2 model using Flan dataset.
Because this is quite a large model it is not eas...
MiMo-V2.5 Provider Pricing and Deployment Guide<p>MiMo-V2.5 is worth paying attention to because it puts three things developers usually have to trade off into the same conversation: open weights, a 1 million-token model design, and pricing that can be unusually low depending on where you buy it. On Xiaomi’s first-party API, Artificial Analysis lists MiMo-V2.5 at $0.14 per 1M input tokens […]</p>
Fork of Text Generation Inference.The text generation inference open source project by huggingface looked like a promising
framework for serving large language models (LLM). However, huggingface announced that they
will change the license of code with version v1.0.0. While the previous license Apache 2.0
was permissive, the new on...© 2026 DeepInfra. All rights reserved.