DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Z.ai’s GLM-5.1 is an April 2026 open-weight reasoning model built for long-horizon agentic engineering — and accessing it effectively means navigating a real spread of provider options. Across 10 benchmarked API providers, blended pricing ranges from $0.74 to $1.70 per 1M tokens, output speed from 33.8 to 175.2 t/s, and the fastest provider is 5.2x quicker than the slowest. For teams moving from prototype to production, those differences determine whether this model fits a budget or quietly breaks a latency target. This breakdown covers the full provider landscape — performance metrics, pricing structures, and how to match them to your workload.
GLM-5.1 — Best APIs
| Provider | Why it’s a best pick | Blended ($/1M) | Input ($/1M) | Output ($/1M) | Speed (t/s) | TTFT (s) | Context | JSON | Func |
|---|---|---|---|---|---|---|---|---|---|
| DeepInfra (FP8) | Best cost + top-tier latency — strong for scale and budget-sensitive workloads | 0.74 | 1.05 | 3.50 | 35 | 0.94s | 203k | Yes | Yes |
| Fireworks | Best raw performance — fastest output speed | 0.90 | — | — | 175 | 0.94s | 203k | Yes | Yes |
| Wafer | Strong speed + low blended price | 0.86 | — | — | 160.4 | 1.11s | 203k | Yes | Yes |
| FriendliAI | Balanced speed + low latency + competitive blended price | 0.90 | — | — | 128 | 1.04s | 203k | Yes | Yes |
| SiliconFlow | Low blended price; note: no JSON mode | 0.90 | — | — | 50 | 4.47s | 205k | No | Yes |
GLM-5.1 is Z.ai’s next-generation flagship model for agentic engineering, released on April 7, 2026. It is a post-training refinement of GLM-5, specifically optimized for coding and long-horizon autonomous workflows. The model uses a 754-billion parameter Mixture-of-Experts (MoE) architecture with 40 billion active parameters per token, a 203K context window, and up to 131K output tokens. Weights are available on Hugging Face under the MIT license.
On SWE-Bench Pro, GLM-5.1 scores 58.4, ahead of GPT-5.4 (57.7) and Claude Opus 4.6 (57.3). On AIME 2026, it scores 95.3. The model’s core design principle is sustained improvement across long agentic runs — unlike predecessor models that plateau after initial gains, GLM-5.1 is built to keep improving across hundreds of rounds and thousands of tool calls. For a deeper look at its predecessor, see the GLM-5 API benchmarks.
Z.ai demonstrated this capability by having GLM-5.1 build a complete Linux desktop environment autonomously over 8 hours, running 655 iterations and increasing vector database query throughput to 6.9x the initial baseline. GLM-5.1 is available on DeepInfra at deepinfra.com/zai-org/GLM-5.1.
Based on benchmarks across 10 tracked providers, DeepInfra is the recommended API for production GLM-5.1 deployment. It offers the lowest blended price ($0.74/1M), lowest input price ($1.05/1M), lowest output price ($3.50/1M), and ties Fireworks for fastest TTFT at 0.94s. For applications requiring maximum raw throughput, Fireworks leads at 175.2 t/s. For a balanced alternative at slightly lower cost than Fireworks, Wafer ($0.86/1M blended, 160.4 t/s) is the strongest option.
DeepInfra offers the best overall value for GLM-5.1 — lowest cost across every token metric, top-tier latency, and full feature support.
With a 2.3x pricing spread across the benchmark set, provider choice is a real economic decision for GLM-5.1. DeepInfra’s FP8 deployment delivers cost and latency leadership simultaneously — the strongest combination in the set. For long-context, agentic workloads where GLM-5.1 is designed to shine, DeepInfra’s cached input pricing at $0.205/1M is also the only explicitly listed cache rate among benchmarked providers, making it the most practical option for agent loops, RAG pipelines, and any workload that resends stable prompt prefixes. For context on how GLM-5.1 compares against DeepSeek V3.2 on cost and capability, see the GLM-4.6 vs DeepSeek-V3.2 breakdown.
1. Fireworks — Best for Raw Throughput
Fireworks leads all 10 providers on raw output speed at 175.2 t/s — a 5x advantage over the slowest provider in the set. It also ties DeepInfra on TTFT at 0.94s and shares the same time to first answer token lead at 22.58s. The trade-off is cost: at $0.90/1M blended, it is 22% more expensive than DeepInfra. For throughput-critical applications or any workload where generation speed directly affects user experience, Fireworks is the right choice.
2. Wafer — Best Balanced Alternative
Wafer is the strongest all-around alternative to DeepInfra, ranking #2 on blended price, #2 on output speed (160.4 t/s), and #2 on time to first answer token (24.74s). For teams that want strong performance without paying the Fireworks speed premium or accepting DeepInfra’s lower output throughput, Wafer occupies the clearest middle ground in the benchmark set.
3. FriendliAI — Balanced Across All Metrics
FriendliAI offers a consistent performance profile across all metrics — no single area stands out, but no obvious weaknesses either. At $0.90/1M blended with 128 t/s output and 1.04s TTFT, it is a reliable backup provider for intelligent routing setups where a consistent mid-tier option is needed.
4. SiliconFlow — Low Price, Higher Latency, No JSON Mode
SiliconFlow is the only provider in the benchmark set without JSON mode support. For structured output pipelines or agentic workflows that rely on reliable JSON responses, this creates downstream friction — retries, prompt padding, and format-correction logic that inflates real token counts. Its 4.47s TTFT is also the highest in the set by a significant margin. The competitive blended price needs to be weighed against these operational constraints before committing it to production.
5. Remaining Providers
Novita, Parasail (FP8), Together.ai, Nebius (FP8 Base), and CoreWeave round out the provider set. Together.ai and Nebius both list $1.40/1M input and $4.40/1M output — the most expensive input/output pricing in the benchmark. Parasail has the slowest measured output speed at 33.8 t/s. All support function calling; most support JSON mode. For teams building intelligent routing, these providers can serve as fallbacks, but none present a clear cost or performance advantage over the top-5 options above.
1. Blended Pricing Uses a 7:2:1 Cache-Input-Output Ratio
Artificial Analysis benchmarks GLM-5.1 using a 7:2:1 cache-input-output blended ratio — a reminder that cache behavior is the workload for most production agentic applications, not an edge case. For teams building with GLM-5.1’s long-horizon design in mind (repeated tool schemas, stable system prompts, persistent agent state), the cache ratio matters more than the headline blended figure. DeepInfra is the only provider in this set with an explicitly listed cached input rate ($0.205/1M), which directly maps to cost savings on those patterns. For a practical breakdown of how this plays out across real workloads, see the GLM-5.1 pricing guide.
2. TTFT vs. Time to First Answer Token
GLM-5.1 is a reasoning model with thinking mode enabled by default. TTFT measures time to the first token (often a thinking/reasoning token), while time to first answer token measures when the model begins generating the actual response. For user-facing applications, the latter is the number that matters. Fireworks leads on time to first answer token at 22.58s — significantly ahead of the next provider. When evaluating latency for interactive use cases, make sure you are measuring the right metric.
3. FP8 Quantization and What It Means
DeepInfra serves GLM-5.1 in FP8 quantization. The benchmarked pricing reflects FP8 serving; the model page also lists FP4 pricing. Confirm which serving tier you are buying before locking in cost assumptions. FP8 reduces memory requirements and inference cost with minimal impact on output quality for most production workloads — but for edge-case mathematical reasoning or complex coding tasks, it is worth running evals against your specific prompt distribution.
4. SiliconFlow JSON Mode Gap
All 10 providers support function calling, but SiliconFlow is the only one in the set without JSON mode. For agentic pipelines that rely on structured outputs — tool call responses, retrieval schemas, or any workflow that parses model output programmatically — the absence of JSON mode creates real operational friction. Check your structured output requirements before routing to SiliconFlow.
5. Geographical Routing and Real-World Latency
Benchmark TTFT figures reflect median performance under standardized conditions. Real-world latency varies based on proximity to provider infrastructure. For latency-sensitive applications, it is worth running your own TTFT measurements from your actual deployment region before committing to a provider. The LLM API Provider KPIs guide covers how to interpret these metrics for production decisions.
GLM-5.1 is a strong open-weight choice for agentic engineering and long-horizon coding — but provider selection determines whether its cost and performance profile actually hold in production. With a 2.3x pricing spread and a 5.2x output speed spread across 10 providers, the choice is not cosmetic.
DeepInfra leads across every cost metric and ties for fastest TTFT, making it the strongest starting point for most workloads. Fireworks is the right choice when throughput is the primary constraint. Wafer offers the clearest balanced alternative. SiliconFlow’s missing JSON mode support is a practical blocker for structured output pipelines despite its competitive blended price.
For teams evaluating GLM-5.1 alongside other models in the same family, GLM-5 and GLM-4.6 are both available on DeepInfra. The full text generation model catalog covers the broader open-weight landscape if you want to compare options before committing. Visit deepinfra.com/zai-org/GLM-5.1 to get started.
Fork of Text Generation Inference.The text generation inference open source project by huggingface looked like a promising
framework for serving large language models (LLM). However, huggingface announced that they
will change the license of code with version v1.0.0. While the previous license Apache 2.0
was permissive, the new on...
Qwen3.5 4B via DeepInfra: Latency, Throughput & Cost<p>About Qwen3.5 4B (Reasoning) Qwen3.5 4B is a compact 4-billion parameter open-weights model released in March 2026 as part of Alibaba Cloud’s Qwen3.5 Small Model Series. It employs an Efficient Hybrid Architecture combining Gated Delta Networks (a form of linear attention) with sparse Mixture-of-Experts, delivering high-throughput inference with minimal latency overhead — a significant architectural […]</p>
NVIDIA Nemotron 3 Nano 30B API Benchmarks: Latency & Cost<p>About NVIDIA Nemotron 3 Nano 30B A3B NVIDIA Nemotron 3 Nano 30B A3B is a large language model trained from scratch by NVIDIA, designed as a unified model for both reasoning and non-reasoning tasks. It is part of the Nemotron 3 family — NVIDIA’s most efficient family of open models, built for agentic AI applications. […]</p>
© 2026 DeepInfra. All rights reserved.