DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Provider choice for GLM-5.1 is a real economic decision. Across 10 benchmarked API providers, blended pricing runs from $0.74 to $1.70 per 1M tokens, output speed from 33.8 to 175.2 t/s, and the fastest provider is 5.2x quicker than the slowest. For teams deploying at scale, that spread determines whether this model fits a production budget or quietly wrecks a latency target.
GLM-5.1 is an April 2026 open-weight model from Z.AI, built for long-horizon, tool-using engineering work rather than short one-shot interactions. It carries a ~203K-token context window, MIT license, JSON and function calling support, and credible benchmark results on coding and agentic tasks — making it a strong candidate for teams who want deployment flexibility alongside cost discipline. For a deeper look at its predecessor, see the GLM-5 API benchmarks.
GLM-5.1 is available across 10 benchmarked providers with blended pricing from $0.74 to $1.70 per 1M tokens. DeepInfra leads on cost, Fireworks on raw speed, Wafer on balance, and OpenRouter on managed access. The model is best suited for long-context, tool-using, engineering-heavy workflows where open weights and cost discipline both matter. Teams evaluating alternatives in the same family can compare against GLM-5 and GLM-4.6.
| Best For | Provider | Why |
|---|---|---|
| Lowest price / cost-sensitive workloads | DeepInfra (FP8 benchmarked; FP4 on model page) | Lowest blended price at $0.74/1M tokens; lowest listed input ($1.05) and output ($3.50) pricing. |
| RAG, document-heavy, or agentic workloads | DeepInfra | Combines lowest token pricing with ~202.8K context, cached input at $0.205/1M, JSON, and function calling. |
| Proprietary or managed access | OpenRouter | Managed routing to z-ai/glm-5.1 with provider fallbacks for maximum uptime. |
| Easiest onboarding | DeepInfra | Public deployment of zai-org/GLM-5.1 with full JSON and function calling — no provider selection logic needed. |
| Lowest latency / fastest responses | Fireworks | Leads on output speed (175.2 t/s) and time to first answer token (22.58s). |
| Best balanced alternative | Wafer | #2 on blended price ($0.86), output speed (160.9 t/s), and time to first answer token (24.67s). |
| Speed-focused backup | FriendliAI | #3 on output speed (128.2 t/s) and answer latency (30.62s) at a competitive $0.90 blended price. |
Token pricing is where a model that looks cheap on paper gets expensive in production. GLM-5.1 is built for long sessions, large prompts, and tool use — all of which consume tokens aggressively.
A token is a chunk of text, not a word. Prompts, tool schemas, retrieved documents, chain state, and model output all count toward your bill. For long-context agent workflows, the expensive part is often not the final answer — it is the repeated accumulation of context.
Artificial Analysis also uses a 7:2:1 cache-input-output ratio in its blended benchmark price — a reminder that cache behavior is the workload, not an edge case.
| Token type | What it is | Why it matters |
|---|---|---|
| Input tokens | Tokens you send to the model in the request | Your prompt cost. Includes user input, system prompts, tool definitions, retrieved context, and prior conversation state. |
| Output tokens | Tokens the model generates in response | Usually the most expensive token class per token. Long answers, code generation, and verbose tool reasoning push this up fast. |
| Cached input tokens | Previously seen input tokens billed at a discounted rate | Matters for chat loops, agents, and RAG systems that resend large prompt prefixes. Can materially reduce costs. |
Different GLM-5.1 providers favor different workload shapes — input-heavy, output-heavy, or cache-friendly. Choosing on blended price alone can mislead if your traffic mix is unusual.
| Provider | Token cost profile | Advantages | Disadvantages |
|---|---|---|---|
| DeepInfra | $1.05 input / $3.50 output / $0.205 cached per 1M. Lowest blended at $0.74/1M. | Best for cost-sensitive production. Cached input pricing is the standout lever for RAG, agent loops, and long sessions. | Benchmarked lowest-cost result references FP8; model page lists FP4. Confirm exact serving tier before locking in cost assumptions. |
| OpenRouter | $0.98 input / $3.08 output per 1M. No cache pricing listed. | Lower listed input/output rates than DeepInfra’s model page. Useful for routed access and provider fallback. | No published cache pricing, so repeated-prefix workloads are harder to model. Less direct control over the underlying provider path. |
| Wafer | Blended $0.86/1M (Artificial Analysis). No input/output breakout. | Good price/speed balance — #2 on both blended cost and output speed. | No separate input/output/cache rates available. Hard to model for unusual token mixes. |
| FriendliAI | Blended $0.90/1M (Artificial Analysis). No token-type breakout. | Low enough to stay practical; #3 on output speed and answer latency. | Same visibility gap as Wafer — blended price alone can mislead on asymmetric workloads. |
| SiliconFlow | Blended $0.90/1M. No token-type breakout. | Competitive on blended cost. | No JSON mode (the only provider in the set without it). Missing structured output creates retry logic and prompt padding that inflates real token counts. |
| Novita | Blended $0.90/1M. No token-type breakout. | Sits in the low-cost group. | No detailed input/output/cache pricing available. Harder to budget for prompt-heavy or output-heavy workloads. |
| Fireworks | Not lowest-cost tier on blended pricing. | Best provider when latency is the primary constraint — 175.2 t/s output speed. | You are paying for speed. For large-scale async workloads, the speed premium may not justify the token bill. |
| Together.ai | $1.40 input / $4.40 output per 1M. | Straightforward published token pricing. | More expensive than DeepInfra on both sides. Gap compounds on prompt-heavy or code-heavy workloads. |
| Nebius | $1.40 input / $4.40 output per 1M. Highest blended at $1.70/1M. | No cost advantage. | Most expensive provider in the set. Premium adds up fast on long-context or agentic workloads. |
| Parasail | No token-type breakout. | No cost advantage called out in the benchmark data. | Slowest measured output speed. Low-ish token rates, but worse user experience for interactive use. |
Practical rule of thumb
The biggest pricing trap with GLM-5.1: teams focus on the per-1M headline, then build an agent that resends huge prompt prefixes, emits long tool traces, and act surprised when “cheap” becomes a line item. Model at least three workload shapes before committing to a provider.
DeepInfra runs on bare-metal infrastructure, typically 50–80% cheaper than major cloud competitors, and is the only provider in this benchmark set with explicit cached input pricing for GLM-5.1. For developers building long-session, tool-using, or agentic workloads, that combination of low token cost and cache-aware economics is the clearest cost lever available.
| Model | Best Use Case | Context Window | Input ($/1M) | Output ($/1M) |
|---|---|---|---|---|
| GLM-5.1 | Long-horizon agentic engineering and tool-using workflows | 202,752 tokens | $1.05 | $3.50 |
At $1.05 input / $3.50 output per 1M tokens and a $0.74/1M blended benchmark price under Artificial Analysis’s 7:2:1 cache-input-output mix, DeepInfra gives you more room to scale before token spend becomes the bottleneck. Browse the full text generation model catalog to see how GLM-5.1 compares against other options for your workload.
The scenarios below reflect workloads where DeepInfra’s GLM-5.1 pricing is easiest to justify: long prompts, repeated context, tool-heavy loops, and engineering workflows that keep state across turns.
Scenario 1: Repo-aware coding assistant
A coding assistant that ingests repo context, tool schemas, and prior conversation state on every turn. This is exactly the workload GLM-5.1’s long-context design is built for, and where DeepInfra’s low input pricing keeps monthly spend predictable.
| Metric | Value |
|---|---|
| Volume | 10,000 requests/month |
| Model | GLM-5.1 |
| Provider | DeepInfra |
| Input Tokens | 200,000,000 |
| Output Tokens | 40,000,000 |
| Monthly Cost | $350.00 |
Cost breakdown:
Comparison: The same workload on Together.ai would cost $456.00 — $106.00 more per month.
Scenario 2: Long-document RAG assistant
Input-heavy workloads where policy docs, runbooks, incident history, and retrieved passages dominate the token mix. DeepInfra is strong here because the input side is cheap and the context window is large enough to reduce aggressive chunking. For guidance on building document-processing pipelines, see open vs. closed source model tradeoffs.
| Metric | Value |
|---|---|
| Volume | 50,000 requests/month |
| Model | GLM-5.1 |
| Provider | DeepInfra |
| Input Tokens | 500,000,000 |
| Output Tokens | 50,000,000 |
| Monthly Cost | $700.00 |
Cost breakdown:
Comparison: The same workload on Nebius would cost $920.00 — $220.00 more per month.
Scenario 3: Agent loop with stable prompt prefixes
Where DeepInfra gets especially compelling. If your agent resends the same system prompt, repo map, tool definitions, and workflow instructions on every turn, cached input pricing becomes a structural cost lever. This is exactly the pattern GLM-5.1 is designed for.
| Metric | Value |
|---|---|
| Volume | 100,000 agent turns/month |
| Model | GLM-5.1 |
| Provider | DeepInfra |
| Input Tokens | 100M fresh + 300M cached |
| Output Tokens | 20,000,000 |
| Monthly Cost | $236.50 |
Cost breakdown:
Comparison: The same blended token volume on Nebius ($1.70/1M) would cost materially more. DeepInfra at $0.74/1M blended is the lowest in the benchmark set under the 7:2:1 cache-input-output mix Artificial Analysis uses.
Scenario 4: Batch code generation and refactoring
Asynchronous overnight jobs — generating tests, migrating code, writing adapters, refactoring templates. Speed matters less than price discipline here. This is one of the cleanest cases for DeepInfra over a faster but less cost-efficient option.
| Metric | Value |
|---|---|
| Volume | 25,000 jobs/month |
| Model | GLM-5.1 |
| Provider | DeepInfra |
| Input Tokens | 125,000,000 |
| Output Tokens | 75,000,000 |
| Monthly Cost | $393.75 |
Cost breakdown:
Comparison: The same workload on Together.ai would cost $505.00 — $111.25 more per month.
Scenario 5: Production API at scale
At production scale, small per-token differences compound fast. DeepInfra’s blended price lead becomes operationally meaningful.
| Metric | Value |
|---|---|
| Volume | 1,000,000,000 total tokens/month |
| Model | GLM-5.1 |
| Provider | DeepInfra |
| Token Mix | 7:2:1 cache-input-output |
| Monthly Cost | $740.00 blended benchmark equivalent |
Cost basis:
Comparison: Same volume on Wafer = $860.00. On Nebius = $1,700.00.
For context on how other models in the GLM family perform at this scale, the GLM-4.6 vs DeepSeek-V3.2 comparison breaks down cost and performance tradeoffs across providers.
Choosing a provider for GLM-5.1 comes down to three things: input cost, cache pricing, and latency fit. The spread across 10 providers is wide enough that a naive choice costs real money, and the model’s design for long-horizon tool-using work means the token patterns that hurt most — large inputs, repeated prefixes, verbose tool traces — are exactly where provider differences compound fastest.
For most agentic and engineering workloads, DeepInfra is the strongest starting point: lowest blended and input prices, explicit cached input pricing at $0.205/1M, full JSON and function calling support, and private endpoint deployment for teams that need it. For latency-sensitive interactive applications, Fireworks leads. For managed routing with fallbacks, OpenRouter is the practical choice.
Before building, check the GLM-5.1 API reference on DeepInfra to confirm supported parameters, then model your actual token costs — input-heavy, output-heavy, or cache-friendly — against the pricing tiers covered here. The GLM-5 benchmarks post is also useful context if you are deciding between generations.
How to Use OpenClaw with DeepInfra: Setup & Workflow Guide<p>When you first learn how to use OpenClaw, the onboarding flow asks for an API key and points you toward Anthropic or OpenAI. Reasonable starting point. For production agents running dozens of tasks a day, it’s an expensive one. OpenClaw works with any OpenAI-compatible API, so you can swap the default model for an open-weight […]</p>
Qwen3.5 397B A17B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 397B A17B Qwen3.5 397B A17B is Alibaba Cloud’s largest and most capable multimodal foundation model, released in February 2026. It features a hybrid Mixture-of-Experts (MoE) architecture with 397 billion total parameters and 17 billion active parameters per inference pass, utilizing 512 experts with a routing mechanism selecting a subset per token. This sparse […]</p>
DeepSeek V4 Pro (Max) API Benchmarks: Latency, Throughput & Cost Analysis<p>About DeepSeek V4 Pro DeepSeek V4 Pro is a Mixture-of-Experts (MoE) language model with 1.6 trillion total parameters and 49 billion activated parameters, supporting a 1 million token context window. Designed for advanced reasoning, coding, and long-horizon agent workflows, it represents the fourth generation of DeepSeek’s flagship open-weight models. The model introduces a hybrid attention […]</p>
© 2026 DeepInfra. All rights reserved.