GLM-5.1 API Benchmarks: Latency, Throughput & Cost

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.05.25 by DeepInfra

Z.ai’s GLM-5.1 is an April 2026 open-weight reasoning model built for long-horizon agentic engineering — and accessing it effectively means navigating a real spread of provider options. Across 10 benchmarked API providers, blended pricing ranges from $0.74 to $1.70 per 1M tokens, output speed from 33.8 to 175.2 t/s, and the fastest provider is 5.2x quicker than the slowest. For teams moving from prototype to production, those differences determine whether this model fits a budget or quietly breaks a latency target. This breakdown covers the full provider landscape — performance metrics, pricing structures, and how to match them to your workload.

GLM-5.1 (Reasoning) API Review Summary

10 API providers benchmarked: DeepInfra (FP8), Fireworks, Wafer, FriendliAI, SiliconFlow, Novita, Parasail (FP8), Together.ai, Nebius (FP8 Base), CoreWeave
Benchmarks are median (P50), using a 7:2:1 cache-input-output blended ratio to reflect realistic production usage
Lowest blended price: DeepInfra (FP8) $0.74, Wafer $0.86, SiliconFlow $0.90 per 1M tokens
Lowest input token price: DeepInfra (FP8) $1.05, Nebius (FP8 Base) $1.40, Together.ai $1.40 per 1M
Lowest output token price: DeepInfra (FP8) $3.50, Nebius (FP8 Base) $4.40, Together.ai $4.40 per 1M
Fastest TTFT: Fireworks and DeepInfra (FP8) tied at 0.94s, FriendliAI 1.04s
Fastest output speed: Fireworks 175.2 t/s, Wafer 161 t/s, FriendliAI 128 t/s — a 5.2x spread from fastest to slowest (Parasail at 33.8 t/s)
Pricing spread: 2.3x across providers — DeepInfra $0.74 vs Nebius $1.70 per 1M blended
Feature coverage: All 10 providers support function calling; 9 of 10 support JSON mode (SiliconFlow does not)

GLM-5.1 — Best APIs

Provider	Why it’s a best pick	Blended ($/1M)	Input ($/1M)	Output ($/1M)	Speed (t/s)	TTFT (s)	Context	JSON	Func
DeepInfra (FP8)	Best cost + top-tier latency — strong for scale and budget-sensitive workloads	0.74	1.05	3.50	35	0.94s	203k	Yes	Yes
Fireworks	Best raw performance — fastest output speed	0.90	—	—	175	0.94s	203k	Yes	Yes
Wafer	Strong speed + low blended price	0.86	—	—	160.4	1.11s	203k	Yes	Yes
FriendliAI	Balanced speed + low latency + competitive blended price	0.90	—	—	128	1.04s	203k	Yes	Yes
SiliconFlow	Low blended price; note: no JSON mode	0.90	—	—	50	4.47s	205k	No	Yes

About GLM-5.1

GLM-5.1 is Z.ai’s next-generation flagship model for agentic engineering, released on April 7, 2026. It is a post-training refinement of GLM-5, specifically optimized for coding and long-horizon autonomous workflows. The model uses a 754-billion parameter Mixture-of-Experts (MoE) architecture with 40 billion active parameters per token, a 203K context window, and up to 131K output tokens. Weights are available on Hugging Face under the MIT license.

On SWE-Bench Pro, GLM-5.1 scores 58.4, ahead of GPT-5.4 (57.7) and Claude Opus 4.6 (57.3). On AIME 2026, it scores 95.3. The model’s core design principle is sustained improvement across long agentic runs — unlike predecessor models that plateau after initial gains, GLM-5.1 is built to keep improving across hundreds of rounds and thousands of tool calls. For a deeper look at its predecessor, see the GLM-5 API benchmarks.

Z.ai demonstrated this capability by having GLM-5.1 build a complete Linux desktop environment autonomously over 8 hours, running 655 iterations and increasing vector database query throughput to 6.9x the initial baseline. GLM-5.1 is available on DeepInfra at deepinfra.com/zai-org/GLM-5.1.

Quick Verdict: Which GLM-5.1 Provider is Best?

Based on benchmarks across 10 tracked providers, DeepInfra is the recommended API for production GLM-5.1 deployment. It offers the lowest blended price ($0.74/1M), lowest input price ($1.05/1M), lowest output price ($3.50/1M), and ties Fireworks for fastest TTFT at 0.94s. For applications requiring maximum raw throughput, Fireworks leads at 175.2 t/s. For a balanced alternative at slightly lower cost than Fireworks, Wafer ($0.86/1M blended, 160.4 t/s) is the strongest option.

Overall Recommendation: DeepInfra (FP8)

DeepInfra offers the best overall value for GLM-5.1 — lowest cost across every token metric, top-tier latency, and full feature support.

Output Speed: 35 t/s
Time to First Token: 0.94s (tied #1)
Blended Price: $0.74 / 1M tokens (lowest)
Input Price: $1.05 / 1M tokens (lowest)
Output Price: $3.50 / 1M tokens (lowest)
Cached Input Price: $0.205 / 1M tokens
Context Window: 203k tokens
API Features: JSON Mode + Function Calling — both supported
Deployment: Public and private endpoints available

With a 2.3x pricing spread across the benchmark set, provider choice is a real economic decision for GLM-5.1. DeepInfra’s FP8 deployment delivers cost and latency leadership simultaneously — the strongest combination in the set. For long-context, agentic workloads where GLM-5.1 is designed to shine, DeepInfra’s cached input pricing at $0.205/1M is also the only explicitly listed cache rate among benchmarked providers, making it the most practical option for agent loops, RAG pipelines, and any workload that resends stable prompt prefixes. For context on how GLM-5.1 compares against DeepSeek V3.2 on cost and capability, see the GLM-4.6 vs DeepSeek-V3.2 breakdown.

Provider Analyses

1. Fireworks — Best for Raw Throughput

Output Speed: 175.2 t/s (fastest)
Time to First Token: 0.94s (tied #1)
Blended Price: $0.90 / 1M tokens
Context Window: 203k tokens
API Features: JSON Mode, Function Calling

Fireworks leads all 10 providers on raw output speed at 175.2 t/s — a 5x advantage over the slowest provider in the set. It also ties DeepInfra on TTFT at 0.94s and shares the same time to first answer token lead at 22.58s. The trade-off is cost: at $0.90/1M blended, it is 22% more expensive than DeepInfra. For throughput-critical applications or any workload where generation speed directly affects user experience, Fireworks is the right choice.

2. Wafer — Best Balanced Alternative

Output Speed: 160.4 t/s (#2)
Time to First Token: 1.11s
Blended Price: $0.86 / 1M tokens (#2 lowest)
Context Window: 203k tokens
API Features: JSON Mode, Function Calling

Wafer is the strongest all-around alternative to DeepInfra, ranking #2 on blended price, #2 on output speed (160.4 t/s), and #2 on time to first answer token (24.74s). For teams that want strong performance without paying the Fireworks speed premium or accepting DeepInfra’s lower output throughput, Wafer occupies the clearest middle ground in the benchmark set.

3. FriendliAI — Balanced Across All Metrics

Output Speed: 128 t/s (#3)
Time to First Token: 1.04s (#3)
Blended Price: $0.90 / 1M tokens
Context Window: 203k tokens
API Features: JSON Mode, Function Calling

FriendliAI offers a consistent performance profile across all metrics — no single area stands out, but no obvious weaknesses either. At $0.90/1M blended with 128 t/s output and 1.04s TTFT, it is a reliable backup provider for intelligent routing setups where a consistent mid-tier option is needed.

4. SiliconFlow — Low Price, Higher Latency, No JSON Mode

Output Speed: 50 t/s
Time to First Token: 4.47s (highest in the set)
Blended Price: $0.90 / 1M tokens
Context Window: 205k tokens
API Features: Function Calling only — JSON mode not supported

SiliconFlow is the only provider in the benchmark set without JSON mode support. For structured output pipelines or agentic workflows that rely on reliable JSON responses, this creates downstream friction — retries, prompt padding, and format-correction logic that inflates real token counts. Its 4.47s TTFT is also the highest in the set by a significant margin. The competitive blended price needs to be weighed against these operational constraints before committing it to production.

5. Remaining Providers

Novita, Parasail (FP8), Together.ai, Nebius (FP8 Base), and CoreWeave round out the provider set. Together.ai and Nebius both list $1.40/1M input and $4.40/1M output — the most expensive input/output pricing in the benchmark. Parasail has the slowest measured output speed at 33.8 t/s. All support function calling; most support JSON mode. For teams building intelligent routing, these providers can serve as fallbacks, but none present a clear cost or performance advantage over the top-5 options above.

Technical Deep-Dive: What Developers Need to Know

1. Blended Pricing Uses a 7:2:1 Cache-Input-Output Ratio

Artificial Analysis benchmarks GLM-5.1 using a 7:2:1 cache-input-output blended ratio — a reminder that cache behavior is the workload for most production agentic applications, not an edge case. For teams building with GLM-5.1’s long-horizon design in mind (repeated tool schemas, stable system prompts, persistent agent state), the cache ratio matters more than the headline blended figure. DeepInfra is the only provider in this set with an explicitly listed cached input rate ($0.205/1M), which directly maps to cost savings on those patterns. For a practical breakdown of how this plays out across real workloads, see the GLM-5.1 pricing guide.

2. TTFT vs. Time to First Answer Token

GLM-5.1 is a reasoning model with thinking mode enabled by default. TTFT measures time to the first token (often a thinking/reasoning token), while time to first answer token measures when the model begins generating the actual response. For user-facing applications, the latter is the number that matters. Fireworks leads on time to first answer token at 22.58s — significantly ahead of the next provider. When evaluating latency for interactive use cases, make sure you are measuring the right metric.

3. FP8 Quantization and What It Means

DeepInfra serves GLM-5.1 in FP8 quantization. The benchmarked pricing reflects FP8 serving; the model page also lists FP4 pricing. Confirm which serving tier you are buying before locking in cost assumptions. FP8 reduces memory requirements and inference cost with minimal impact on output quality for most production workloads — but for edge-case mathematical reasoning or complex coding tasks, it is worth running evals against your specific prompt distribution.

4. SiliconFlow JSON Mode Gap

All 10 providers support function calling, but SiliconFlow is the only one in the set without JSON mode. For agentic pipelines that rely on structured outputs — tool call responses, retrieval schemas, or any workflow that parses model output programmatically — the absence of JSON mode creates real operational friction. Check your structured output requirements before routing to SiliconFlow.

5. Geographical Routing and Real-World Latency

Benchmark TTFT figures reflect median performance under standardized conditions. Real-world latency varies based on proximity to provider infrastructure. For latency-sensitive applications, it is worth running your own TTFT measurements from your actual deployment region before committing to a provider. The LLM API Provider KPIs guide covers how to interpret these metrics for production decisions.

Conclusion

GLM-5.1 is a strong open-weight choice for agentic engineering and long-horizon coding — but provider selection determines whether its cost and performance profile actually hold in production. With a 2.3x pricing spread and a 5.2x output speed spread across 10 providers, the choice is not cosmetic.

DeepInfra leads across every cost metric and ties for fastest TTFT, making it the strongest starting point for most workloads. Fireworks is the right choice when throughput is the primary constraint. Wafer offers the clearest balanced alternative. SiliconFlow’s missing JSON mode support is a practical blocker for structured output pipelines despite its competitive blended price.

For teams evaluating GLM-5.1 alongside other models in the same family, GLM-5 and GLM-4.6 are both available on DeepInfra. The full text generation model catalog covers the broader open-weight landscape if you want to compare options before committing. Visit deepinfra.com/zai-org/GLM-5.1 to get started.

What Is Google TurboQuant and What Does It Mean for Open Source Inference? - Deep InfraIn late March 2026, Google Research published a paper that got more attention outside of academic circles than most AI research does. TurboQuant, a new compression algorithm for the key-value cache in large language models, landed with enough noise that Cloudflare CEO Matthew Prince called it Google’s DeepSeek moment. The Silicon Valley Pied Piper comparisons […]

MiMo-V2.5 Model Documentation and Integration GuideMiMo-V2.5 is a native omnimodal model developed by XiaomiMiMo, designed to process and understand text, image, video, and audio through a unified architecture rather than relying on “bolted-on” components for each modality. Built on a 310-billion-parameter Sparse Mixture of Experts (MoE) architecture — with only 15 billion parameters activated during inference — MiMo-V2.5 offers a […]

Introducing GLM-5.2 on DeepInfraGLM-5.2 is Z-AI’s latest flagship model, built around one core capability: a stable, 1,048,576-token context window designed for long-horizon tasks. Most million-token context claims come with practical asterisks — degraded retrieval, inconsistent behavior at range. Z-AI describes this as the first time that scale has been delivered with reliability for sustained, long-horizon work. The coding […]

View all