DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Deploying the GLM-5.2 (max) Mixture-of-Experts model — 753B total parameters with roughly 40B active per token and a 1M context window — requires infrastructure that separates production-grade API providers from the rest. This guide breaks down the top providers by throughput, latency, pricing, and quantization architecture.
| Provider | Why it’s a strong option for GLM-5.2 (max) | Best-fit use cases | What to confirm before choosing |
|---|---|---|---|
| DeepInfra | Strong candidate for hosting open-weights models with a production API surface — useful when you want to deploy a MIT-licensed, Hugging Face-available MoE model without self-hosting. This model’s high throughput (115.2 tok/s) and 1M context are a good match for providers optimized for scalable inference. | Long-context applications (1M tokens), fast interactive experiences (high tok/s), reasoning-heavy workloads where you still want open weights flexibility | Whether DeepInfra currently serves GLM-5.2 (max), the exact input/output/cache pricing it offers, TTFT and throughput on its infrastructure, rate limits, max output token limits, and any caching write/storage fees |
| Other providers (comparison) | Artificial Analysis reports pricing and performance as first-party API (if available) or median across providers when first-party isn’t available; provider-to-provider variance can materially change cost and latency. | Cost-sensitive deployments (this model is pricey vs peers), latency-sensitive apps (TTFT matters), workloads that can exploit caching | Check per-provider price, TTFT, throughput, caching policy, and availability |
| API Provider | Output Speed (t/s) | Latency (TTFT)* | Blended Price (per 1M tokens) | Quantization / Precision | Best For |
|---|---|---|---|---|---|
| DeepInfra | Top Tier | Top Tier | $0.80 | FP4 | Overall Value |
| Fireworks | 314.9 | 8.14s | $0.90 | FP8 | Speed |
| Baseten | 277.8 | 8.93s | Undisclosed | Undisclosed | High Throughput |
| Databricks | 240.9 | 9.23s | Undisclosed | Undisclosed | Enterprise |
| GMI | Undisclosed | Undisclosed | $0.72 | FP8 | Budget |
| CoreWeave | 165.9 | 13.43s | $0.90 | Undisclosed | Infrastructure |
*Note: For reasoning models like GLM-5.2 (max), Time to First Token (TTFT) includes the model’s internal “thinking” time prior to outputting the final answer.
GLM-5.2 is Z.ai’s latest flagship model, built for coding, reasoning, and tool-driven agentic workloads. Released on June 13, 2026, it succeeds GLM-5.1 in the GLM-5 family and represents a significant evolution from the original GLM-5 (744B parameters) released in February 2026.
Z.ai — formerly Zhipu AI — became a publicly traded foundation model company with its Hong Kong IPO in January 2026. The company, founded in 2019 as a spin-off from Tsinghua University, has established itself as a leader in open-source AI research with a consistent release cadence.
GLM-5.2 (max) scores 51 on the Artificial Analysis Intelligence Index, placing it ahead of MiniMax-M3 (44), DeepSeek V4 Pro (44), and Kimi K2.6 (43). The model was reportedly trained on Huawei Ascend chips using the MindSpore framework — a notable detail given Z.ai’s placement on the U.S. Entity List, which restricts access to NVIDIA H100/H200 GPUs.
DeepInfra is the overall recommended API provider for GLM-5.2 (max). Serving a large 753B parameter MoE model is notoriously difficult, but DeepInfra leverages FP4 (4-bit floating point) quantization to achieve an efficient deployment.
Why FP4 matters: NVIDIA’s Blackwell architecture (B200 GPUs) features native FP4 tensor cores that enable hardware-accelerated FP4 compute. FP4 quantization can theoretically achieve meaningful speedup compared to BF16 inference while delivering substantial memory reduction, with accuracy recovery generally improving on larger MoE architectures like GLM-5.2 (max). This allows DeepInfra to support the model’s full 1,048,576-token context window while maintaining strong inference speeds that rival FP8 deployments.
At a blended price of $0.80 per 1M tokens, DeepInfra strikes a strong balance between cost-efficiency, throughput, and memory optimization for developers building agentic workflows that require extensive reasoning and long-horizon context.
Fireworks is the throughput leader for GLM-5.2 (max), achieving 314.9 tokens per second (t/s) — the fastest provider benchmarked for this model. It also reports a Time to First Token (TTFT) of 8.14 seconds, which accounts for the model’s reasoning phase.
Priced at a blended rate of $0.90 per 1M tokens and utilizing FP8 precision, Fireworks is suited for applications where rapid token generation and low end-to-end response times are priorities. FP8 typically delivers meaningful latency improvements compared to FP16 while maintaining near-lossless output quality.
For cost optimization, GMI leads with a blended price of $0.72 per 1M tokens — the lowest among benchmarked providers. GMI uses FP8 quantization, which reduces memory bandwidth requirements compared to standard FP16 deployments while retaining output quality.
GMI’s pricing structure makes it attractive for batch processing or high-volume, automated agentic tasks where millisecond latency is not the primary constraint.
Baseten delivers an output speed of 277.8 t/s and a latency of 8.93 seconds. Serving a model that activates 40B parameters per token requires robust GPU orchestration, and Baseten’s infrastructure handles this efficiently.
Baseten is a reasonable endpoint for developers who need sustained, high throughput for real-time coding assistants or complex multi-step reasoning applications.
Databricks offers a performant endpoint for enterprise users, achieving an output speed of 240.9 t/s and a latency of 9.23 seconds.
For teams already embedded in the Databricks ecosystem, this provider offers a way to integrate GLM-5.2 (max)’s reasoning capabilities and context window into existing data pipelines and software engineering workflows.
Z.ai’s native API provides the baseline experience for GLM-5.2 (max). Pricing is explicit: $1.40 per 1M input tokens, $4.40 per 1M output tokens, and a discounted $0.26 per 1M cached tokens (resulting in a blended price of roughly $0.90).
The first-party API provides Anthropic-compatible endpoints (https://api.z.ai/api/coding/paas/v4), making it straightforward to integrate into existing tools like Claude Code or Cline. While median speeds across generic providers sit at 115.2 t/s, Z.ai’s native endpoint guarantees day-zero feature support and native integration of the model’s “high” and “max” reasoning effort modes.
Understanding quantization is useful when selecting a provider for large MoE models like GLM-5.2 (max):
Providers serving GLM-5.2 (max) must also optimize for KV cache scaling to handle the 1M-token context window, typically using techniques like continuous batching and efficient memory management via inference frameworks such as vLLM or TensorRT-LLM.
Deploying a large Mixture of Experts model like GLM-5.2 (max) requires API providers to push the boundaries of hardware optimization and memory bandwidth. The model’s 1M-token context window, 40B active parameters per token, and intensive reasoning capabilities demand infrastructure that can balance throughput, latency, and cost.
By using FP4 quantization on modern GPU architectures, DeepInfra (deepinfra.com) balances the model’s large context window, fast token generation, and a competitive $0.80 blended price point. Whether you are building long-horizon coding agents or complex reasoning systems, DeepInfra provides a technically capable infrastructure option for GLM-5.2 (max).
Building a Voice Assistant with Whisper, LLM, and TTSLearn how to create a voice assistant using Whisper for speech recognition, LLM for conversation, and TTS for text-to-speech.
Kimi K2.6 Pricing Guide 2026: Compare Costs & Deployment Strategies<p>Kimi K2.6 matters because it sits in a rare spot: open weights, broad provider availability, and a real spread in pricing and runtime performance depending on where you buy it. Artificial Analysis tracks the model across nine API providers, with blended pricing ranging from $1.15 to $2.15 per 1M tokens and major differences in throughput […]</p>
Qwen3.5 122B A10B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 122B A10B Qwen3.5 122B A10B is Alibaba Cloud’s mid-tier multimodal foundation model, released in February 2026. It is a multimodal vision-language Mixture-of-Experts model supporting text, image, and video inputs, designed for native multimodal agent applications. It features 122 billion total parameters with 10 billion activated per token through a hybrid architecture that integrates […]</p>
© 2026 DeepInfra. All rights reserved.