DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

If you care about long-context reasoning but don’t want to lock yourself into a closed model, GLM 5.2 is worth attention for one simple reason: it pairs a 1M-token context window with open weights, MIT licensing, and a real provider market instead of a single take-it-or-leave-it endpoint. That makes it unusually relevant for teams doing cost tuning, provider failover, or self-hosting evaluations right now.
GLM 5.2 is a reasoning model from Z.ai, released on June 16, 2026. Across the research, it appears as GLM-5.2, GLM-5.2 (max), and provider slugs like z-ai/glm-5.2 or zai-org/GLM-5.2, but the core picture is consistent: this is a text-in, text-out Mixture-of-Experts model with 753B total parameters, 40B active parameters per token, and a 1M-token context window. It is open weight, commercially usable under the MIT license, and available both through hosted APIs and public model weights. The GLM-5.2 model card on DeepInfra also positions it as the first GLM release to bring long-horizon capability to a “solid” 1M-token context, with English and Chinese support and production-friendly features like JSON output and function calling.
What makes GLM 5.2 stand out is not that it is the cheapest model overall — it isn’t — but that it combines strong benchmark performance with unusually broad deployment flexibility. Artificial Analysis gives it an Intelligence Index score of 51, which it describes as well above the median for comparable open-weight models, while OpenRouter lists it as better than 88% of models compared on that index, better than 87% on coding, and better than 90% on agentic performance. DeepInfra’s benchmark tables also show it holding up credibly against models like Qwen3.7-Max, MiniMax M3, DeepSeek-V4-Pro, Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro across reasoning, coding, and agentic tasks. On raw economics, the picture is more nuanced: Artificial Analysis calls it expensive relative to open-weight peers, but OpenRouter shows meaningful provider variation, with 25 listed providers and base pricing ranging from $0.95 to $3.00 per million input tokens and $3.00 to $10.25 per million output tokens.
For developers and ML teams, that nuance is the real story. GLM 5.2 looks strongest when you need long-running agent workflows, project-scale coding, tool use, or document-heavy pipelines where a 1M-token window actually changes system design. It also looks more practical than many benchmark-only contenders because integration paths are already clear: OpenRouter exposes an OpenAI-compatible API with routing modes and reasoning controls, while DeepInfra offers both public and private endpoint deployment.
GLM 5.2 sits in an interesting pricing tier: base rates start at $0.95 per 1M input tokens and $3.00 per 1M output tokens, but the market spans 25 OpenRouter providers and climbs as high as $3.00 input and $10.25 output depending on who serves it. It is best suited for teams that need open-weight reasoning, long-context coding or agent workflows, and enough provider choice to optimize for cost, latency, or deployment control rather than accepting a single default.
| Best For | Provider | Why |
|---|---|---|
| RAG, document-heavy, or high-throughput use cases | DeepInfra Standard | Combines 1M-context access with $0.95/$3.00 base pricing, $0.18 cached input, JSON and function calling support, and one of the lowest effective input prices after caching at $0.298. |
| Production workloads needing more predictable service tiering | DeepInfra Priority | DeepInfra offers a Priority tier with a clear 1.5× pricing model, useful if you want the same model and interface with a higher service tier instead of changing providers. |
| Lowest price / cost-sensitive workloads | NovitaAI via OpenRouter | Lowest listed base price in the provider set at $0.95 input and $3.00 output, matching the floor of the market for teams optimizing headline token cost. |
| Easiest onboarding / fastest time-to-first-call | OpenRouter | OpenAI-compatible API for z-ai/glm-5.2, supporting routing modes, streaming, and reasoning controls across OpenAI, Anthropic Messages, and Responses-style integrations. |
| Managed model access with broad provider choice | OpenRouter Exacto or Balanced routing | Exposes 25 providers for the same model, so you can optimize for fixed-provider consistency with Exacto or let routing balance price and speed automatically. |
| Lowest effective input cost after caching | Decart via OpenRouter | Posts the best effective input price in the research at $0.268 per 1M tokens, driven by a 93.2% cache hit rate. |
| Fastest throughput-focused deployments | Fireworks Fast via OpenRouter | Listed among the top throughput providers at 117 tokens/sec average while staying close to floor pricing at $0.98 input and $3.08 output. |
| Lowest end-to-end latency for short completions | Wafer Fast via OpenRouter | Best reported 500-token end-to-end latency at 4.46s, though at a large premium ($3.00 input and $10.25 output). |
GLM 5.2 is still billed the usual way: by tokens in, tokens out, and sometimes cached tokens if your provider can reuse prompt prefixes. The catch is that this model has a 1M-token context window and a reasoning-oriented output style, so small pricing differences can turn into real money fast.
| Token type | What it is | Why it matters |
|---|---|---|
| Input tokens | Everything you send to the model: system prompt, user prompt, tool schemas, retrieved docs, chat history | This is where long-context apps get expensive. With GLM 5.2, it is easy to send more context than intended. |
| Output tokens | Everything the model generates back | Reasoning models can be wordy. GLM 5.2 was described as somewhat verbose in evaluation, so output spend can creep up even when input pricing looks fine. |
| Cached input tokens | Reused prompt tokens that the provider can bill at a lower cache-hit rate | This is the main lever for making repeated workflows affordable. Large static prefixes, agent instructions, and stable RAG scaffolding benefit the most. |
The good news is that GLM 5.2 has an actual provider market. The bad news is that the cheapest listed rate is not always the cheapest real rate once caching, latency, uptime, and output pricing start doing their thing.
| Provider / route | Base token pricing | Advantage | Downside |
|---|---|---|---|
| DeepInfra Standard | $0.95 input / $3.00 output / $0.18 cached input | Tied for the market floor on base price. Strong effective input cost after caching at $0.298/1M with 84.6% cache hit rate. Good fit for repeated prompts, RAG templates, and agent loops with stable prefixes. | Cheapest effective input does not automatically mean cheapest total request cost if your workload is output-heavy. |
| DeepInfra Priority | $1.425 input / $4.50 output / $0.27 cached input | Predictable premium: exactly 1.5× Standard pricing. Useful when you want a higher service tier without reworking integrations or routing logic. | You are paying a clear surcharge on every token. Only makes sense if service tiering matters more than raw token efficiency. |
| NovitaAI via OpenRouter | $0.95 / $3.00 / $0.18 | Lowest listed base price, tied with DeepInfra Standard. Strong effective input pricing at $0.347/1M and large real-world token share. | Not the best effective input cost after caching, even with low headline rates. Highly cacheable workloads may do better elsewhere. |
| Decart via OpenRouter | Base list price not the lowest; effective input is $0.268/1M | Best effective input price in the research because of a 93.2% cache hit rate. Attractive for apps with large repeated prompt prefixes. | The advantage depends on prompts actually hitting cache. If requests are highly unique, the effective edge can disappear fast. |
| Fireworks / Fireworks Fast via OpenRouter | ~$0.98 input / $3.08 output / $0.182 cache (Fireworks); Fireworks Fast in the $1.20–$1.40 input / $4.10–$4.40 output band | Close to floor pricing while posting some of the best throughput figures. Good when both token cost and speed matter. | Not the absolute cheapest on either base price or effective cached input. You trade a little token efficiency for speed. |
| Z.ai via OpenRouter | $1.20–$1.40 input / $4.10–$4.40 output | Solid cache hit rate at 86.62% and low tool-call error rate at 0.50%. Useful if stable tool use matters more than floor pricing. | More expensive than the lowest-cost providers on both input and output. Hard to justify for cost-sensitive bulk inference. |
| Wafer Fast via OpenRouter | $3.00 input / $10.25 output / $0.50 cache | Best reported end-to-end latency for short 500-token completions at 4.46s. | Steep token pricing. Output-heavy workloads get expensive very quickly. |
| OpenRouter Balanced / Nitro / Exacto routing | Depends on routed provider | One API with the ability to optimize for cost, speed, or provider consistency without changing app code. Exacto keeps token cost behavior more stable. | Routing convenience can obscure what you are actually paying unless you monitor provider selection and token mix carefully. |
If you want GLM 5.2 on an endpoint built for serious production use, DeepInfra is the power-user pick. It runs on bare-metal infrastructure, which matters because cutting out extra virtualization layers can help with both performance consistency and cost efficiency. DeepInfra also positions itself as typically 50–80% cheaper than major cloud competitors. In practice, it’s best suited for developers who want sharp token pricing, predictable deployment options, and a path from public API access to more controlled setups without changing models. If you want to compare GLM 5.2 against its predecessor before committing, the GLM-5.1 model overview is a good reference for how the family has evolved.
| Model Name | Best Use Case | Context Window | Input Price (per 1M tokens) | Output Price (per 1M tokens) |
|---|---|---|---|---|
| GLM 5.2 Standard | High-volume production inference with the lowest DeepInfra token cost | 1,048,576 tokens | $0.95 | $3.00 |
| GLM 5.2 Priority | Higher service tier for production workloads needing priority handling | 1,048,576 tokens | $1.425 | $4.50 |
Why this matters: On DeepInfra Standard, GLM 5.2 is $0.95 per 1M input tokens and $3.00 per 1M output tokens. Artificial Analysis reports the median across providers for this model at $1.40 input and $4.40 output, so DeepInfra is meaningfully below the market midpoint while still offering a Priority tier if you need it. For teams pushing a lot of long-context traffic, that gap adds up fast.
If GLM 5.2 is headed into a high-volume pipeline, DeepInfra is one of the cleanest places to pressure-test the economics before spend gets away from you. It’s especially compelling when you want low base pricing now and the option to scale into more controlled deployment later. For teams that also want to weigh the older generation, the GLM-5.1 pricing guide breaks down how the 5.1 provider market compares across the same axes.
Below are the kinds of workloads where DeepInfra is a particularly strong GLM 5.2 provider choice: long-context inputs, repeated prompt scaffolding, structured outputs, and production environments where you may want to start on a public endpoint and later move to a private deployment path without changing models.
A developer team is using GLM 5.2 for project-level code edits, architecture Q&A, and multi-file refactors. Each request includes a large, mostly stable prompt prefix: coding standards, repo conventions, tool schemas, and shared instructions. That is exactly the kind of pattern where DeepInfra Standard looks good: low base token pricing, cached input support at $0.18/1M, and a strong observed effective input price of $0.298/1M. Teams currently running similar workloads on the GLM-5.1 demo endpoint will find the migration path to GLM 5.2 straightforward since the interface conventions are consistent.
| Metric | Value |
|---|---|
| Volume | 10,000 requests/month |
| Model | GLM 5.2 |
| Provider | DeepInfra Standard |
| Input Tokens | 500M/month |
| Output Tokens | 100M/month |
| Monthly Cost | $775 |
Cost math:
Same workload at the Artificial Analysis median provider price ($1.40 input / $4.40 output): $1,140/month — $365/month more than DeepInfra Standard.
A team is feeding long manuals, policy docs, or customer history into GLM 5.2 for grounded answers and structured extraction. This is a classic use case for the model’s 1,048,576-token context window, but it can become expensive quickly if input pricing is mediocre. DeepInfra is appealing here because it sits at the market floor on listed base pricing and supports cached input for repeated retrieval scaffolding.
| Metric | Value |
|---|---|
| Volume | 2,000 jobs/month |
| Model | GLM 5.2 |
| Provider | DeepInfra Standard |
| Input Tokens | 2B/month |
| Output Tokens | 200M/month |
| Monthly Cost | $2,500 |
Cost math:
Same workload on Wafer Fast ($3.00 input / $10.25 output): $8,050/month — $5,550/month more than DeepInfra Standard.
Some teams do not want to juggle multi-provider routing just to get a stronger service tier. They want the same model, same interface, and a simple premium for higher-priority handling. This is one of DeepInfra’s cleaner advantages: Priority pricing is exactly 1.5× Standard, which makes planning straightforward. If latency under load is a key concern, the GLM-5.1 API benchmarks for latency and throughput provide a useful baseline for setting expectations on the GLM family.
| Metric | Value |
|---|---|
| Volume | 5,000 workflows/month |
| Model | GLM 5.2 |
| Provider | DeepInfra Priority |
| Input Tokens | 750M/month |
| Output Tokens | 150M/month |
| Monthly Cost | $1,743.75 |
Cost math:
Same workload on Wafer Fast ($3.00 input / $10.25 output): $3,787.50/month — $2,043.75/month more than DeepInfra Priority.
Suppose you are building an API that turns messy text into validated JSON: invoices, forms, tickets, case notes, or multilingual business docs. GLM 5.2’s JSON output and function calling support matter here, but so does not overpaying for output. DeepInfra stands out because it keeps output at $3.00/1M, while some providers push far higher. If you also have voice-based extraction workloads, the GLM-5.2 voice endpoint is worth a look as a complementary surface for audio-first pipelines.
| Metric | Value |
|---|---|
| Volume | 1,000,000 documents/month |
| Model | GLM 5.2 |
| Provider | DeepInfra Standard |
| Input Tokens | 300M/month |
| Output Tokens | 300M/month |
| Monthly Cost | $1,185 |
Cost math:
Same workload at the Artificial Analysis median price ($1.40 input / $4.40 output): $1,740/month — $555/month more than DeepInfra Standard.
This one is less about the list price and more about how DeepInfra performs when your prompt shape is repetitive. OpenRouter’s provider data shows DeepInfra at $0.298 effective input price after caching, with an 84.6% cache hit rate. If your internal agent reuses the same long prompt skeleton across thousands of calls, DeepInfra gets even more compelling. For teams who also want to keep an eye on broader open-weight options, the original GLM-5 demo is a useful reference point for how this family handles long-context reasoning at the base level.
| Metric | Value |
|---|---|
| Volume | 50,000 agent turns/month |
| Model | GLM 5.2 |
| Provider | DeepInfra Standard |
| Input Tokens | 1B/month |
| Output Tokens | 100M/month |
| Monthly Cost | $1,250 |
Cost math:
Same workload at the Artificial Analysis median price ($1.40 input / $4.40 output): $1,840/month — $590/month more than DeepInfra Standard.
For teams with strong cache reuse, DeepInfra’s observed effective input price of $0.298/1M is one of the clearest signs that the provider is not just cheap on paper — it is well-positioned for real repeated-workload economics too. Voice-driven assistants exploring similar territory can also look at the GLM-5.1 voice endpoint for comparison.
The provider decision for GLM 5.2 is not really about picking the cheapest number on a comparison table. It is about matching your workload shape — input volume, output verbosity, cache reuse, and service tier requirements — to a provider whose actual economics hold up under your real traffic patterns. A model with a 1M-token context window and reasoning-oriented output can look affordable at the headline rate and expensive in practice if you are not thinking about the full token mix.
The two criteria that separate good GLM 5.2 deployments from expensive ones are output pricing discipline and prompt caching strategy. The spread from $3.00 to $10.25 per million output tokens across providers is wide enough to dominate your bill on any verbose workload, and the difference between a good cache hit rate and a poor one can shift your effective input cost by 60–70%. If your application has stable system prompts, shared retrieval scaffolding, or repeated agent instructions, that cached input rate at $0.18 per million tokens on DeepInfra Standard is a core part of the cost model, not a footnote. API compatibility matters too: JSON output and function calling support are table stakes for most production pipelines, and not every provider exposes them cleanly.
DeepInfra sits at the market floor on base pricing while also offering a clear Priority tier for teams that want higher service levels without reworking their integration. When you are ready to wire it into your application, the GLM-5.2 API reference covers everything you need to make your first call. And if you want to compare GLM 5.2 against other models in the same family or across the broader catalog, DeepInfra’s full model listing is a practical starting point for scoping alternatives.
DeepSeek V4 Pro Pricing Guide 2026: Pricing, Providers & Cost Comparison<p>DeepSeek V4 Pro matters because it pushes two levers developers actually care about at the same time: open-weight availability and a very competitive provider market. As of the research here, DeepSeek V4 Pro Max is tracked across six API providers, and five of them cluster at the same blended price of $2.17 per 1M tokens […]</p>
Qwen API Pricing Guide 2026: Max Performance on a Budget<p>If you have been following the AI leaderboards lately, you have likely noticed a new name constantly trading blows with GPT-4o and Claude 3.5 Sonnet: Qwen. Developed by Alibaba Cloud, the Qwen model family (specifically Qwen 2.5 and Qwen 3) has exploded in popularity for one simple reason: unbeatable price-to-performance. In 2025, Qwen is widely […]</p>
How to OpenAI Whisper with per-sentence and per-word timestamp segmentation using DeepInfraWhisper is a Speech-To-Text model from OpenAI.© 2026 DeepInfra. All rights reserved.