FLUX.2 is live! High-fidelity image generation made simple.

GLM-4.6 is a high-capacity, “reasoning”-tuned model that shows up in coding copilots, long-context RAG, and multi-tool agent loops. With this class of workload, provider infrastructure determines perceived speed (first-token time), tail stability, and your unit economics. Using ArtificialAnalysis (AA) provider charts for GLM-4.6 (Reasoning), DeepInfra (FP8) pairs a sub-second Time-to-First-Token (TTFT) (0.51 s) with the lowest output price ($1.9/M) among the cohort, while keeping competitive throughput at 100k-token prompts. Baseten wins a few raw-speed trophies (fastest TTFT and the highest t/s), but charges more per output token.
GLM-4.6 is Zhipu’s latest general-purpose reasoning model. AA evaluates it with input sizes up to 100k tokens, which makes the provider’s long-context behavior particularly relevant for RAG over large docs/repos. The model slots well into:
What GLM-4.6 doesn’t provide by itself: browsing, code execution, or file I/O—those depend on your orchestration layer. In sizing, pick it when you need stronger reasoning and very long contexts; otherwise, a cheaper/smaller model may be sufficient for bulk chat or classification.
Compared to its predecessor GLM-4.5, the newer version brings several key improvements:
TTFT is the elapsed time from request to the first streamed token. For chat UIs, IDEs, and step-wise agents, it’s the metric users feel.
API ranking & deltas vs DeepInfra (0.51 s):
DeepInfra sits in the sub-second group and is second-fastest here, within ~120 ms of the leader. Baseten is the right pick when you absolutely need the earliest possible first token. Everyone else pays a noticeable TTFT tax relative to DeepInfra.
Predictability is about how steady the stream is, not just how fast it starts. The variance plot for output speed (tokens/sec) illustrates this directly: the median indicates typical throughput, the IQR (box) captures routine jitter, and the whiskers/tails highlight rare slowdowns that exceed your SLOs. Volatility in tokens per seconds inflates p95/p99 latency even when TTFT is fine—users feel that as pauses, tool timeouts, or retries.
How to read it:
Pick providers that combine high median t/s with compact spread; they’ll miss latency targets less often and keep agents moving consistently.
Baseten clearly leads on median tokens/sec, but DeepInfra and Novita form a stable middle that’s ~30–35% faster than Parasail’s median and ~14% faster than GMI’s. In practice, that means fewer “stalls” during streaming and tighter loop times in agent chains.
AA reports output speed (t/s) across increasing input sizes. For long-context RAG and repo analysis, the 100k column is the differentiator. This chart tells you how much the performance declines for longer input contexts
What it means.
If your product routinely sends 100k-token prompts (large PDFs, monorepo sweeps), Baseten’s raw throughput minimizes streaming time; DeepInfra, however, combines fast TTFT with middle-to-high throughput and the lowest price—often the better speed-per-dollar choice.
Two providers separated by $0.3/M may look close, but at scale (hundreds of millions of tokens) it’s real money. And when you pair a cheaper provider with faster TTFT, you enable tighter agent loops—users send more prompts per minute without feeling lag.
AA shows identical input price for all five providers and different output prices:
Bottom line: DeepInfra is cheapest on output at $1.9/M—a 5–14% advantage over peers—while inputs are a wash at $0.6/M.
Use the asymmetric formula:
Then you get these prices for the different providers:
| Provider | Output $/M | Total cost for 3k/1k |
| DeepInfra | 1.9 | $0.00370 |
| GMI | 2.0 | $0.00380 (+2.7% vs DeepInfra) |
| Parasail (FP8) | 2.1 | $0.00390 (+5.4%) |
| Novita | 2.2 | $0.00400 (+8.1%) |
| Baseten | 2.2 | $0.00400 (+8.1%) |
AA plots end-to-end seconds to stream 500 output tokens on the y-axis against the effective price ($ per 1M tokens) on the x-axis. E2E latency captures the whole user wait—TTFT + generation time—so it reflects both startup overhead and streaming throughput. The lower-left quadrant is the Pareto-efficient region: points there deliver the same work faster and cheaper. Providers that sit up-and-to-the-right are dominated (you’re paying more and waiting longer).
How to use the chart:
If your sole target is the fastest 500-token completion, Baseten leads near $1.00/M. If you want the best price/performance—sub-second TTFT, steady throughput, and the lowest $/M—DeepInfra (FP8) is the practical default; it’s cheaper, faster than GMI and Parasail in this view, and roughly on par with Novita. In all cases, pick the cheapest provider that clears your latency SLO.
| Provider | TTFT @1k (s) | Input $/M | Output $/M | Notes |
| DeepInfra (FP8) | 0.51 | 0.6 | 1.9 | 100k throughput 48 t/s; median throughput 50.6 t/s. |
| Baseten | 0.21 | 0.6 | 2.2 | Fastest TTFT; 100k throughput 97 t/s; median throughput 113.4 t/s. |
| Novita | 0.77 | 0.6 | 2.2 | 100k throughput 59 t/s; median 52.2 t/s. |
| GMI | 1.13 | 0.6 | 2.0 | 100k throughput 78 t/s; median 43.6 t/s. |
| Parasail (FP8) | 0.57 | 0.6 | 2.1 | 100k throughput 28 t/s; median 38.5 t/s. |
For real users, responsiveness and steadiness trump peak benchmarks. In IDE assistants and code review, DeepInfra (FP8) clears the interactivity bar with a 0.51 s TTFT and a 50.6 t/s median streaming rate—fast enough to feel instant and smooth—while its $1.9/M output price yields a lower unit cost. In multi-tool agent loops, step time is effectively TTFT + (tokens ÷ t/s); DeepInfra’s combination of quick first tokens and predictable throughput keeps cadence tight and lets you execute more steps per dollar.
For long-context RAG, DeepInfra’s 48 t/s at 100k input tokens isn’t the raw-throughput leader, but the earlier first token versus high-throughput peers (e.g., GMI at 78 t/s) sustains an interactive feel in document-heavy sessions. If you can trade a modest throughput gap for better first-token latency and lower $/M, DeepInfra is the pragmatic default. On the FinOps side, a typical 3k-in / 1k-out call prices at $0.00370, 2.7–8.1% cheaper than the rest of this cohort—savings that compound materially at scale without sacrificing perceived speed. Net-net: unless you’re optimizing purely for maximum streaming rate regardless of cost, DeepInfra offers the best balance of latency, predictability, and price for GLM-4.6.
Disclaimer: This article reflects data and pricing as of October 15, 2025. Benchmarks and provider rates can change quickly, so please verify the linked sources and rerun representative evaluations before making decisions.
Kimi K2 0905 API from Deepinfra: Practical Speed, Predictable Costs, Built for Devs - Deep Infra<p>Kimi K2 0905 is Moonshot’s long-context Mixture-of-Experts update designed for agentic and coding workflows. With a context window up to ~256K tokens, it can ingest large codebases, multi-file documents, or long conversations and still deliver structured, high-quality outputs. But real-world performance isn’t defined by the model alone—it’s determined by the inference provider that serves it: […]</p>
Compare Llama2 vs OpenAI models for FREE.At DeepInfra we host the best open source LLM models. We are always working hard to make
our APIs simple and easy to use.
Today we are excited to announce a very easy way to quickly try our models like
Llama2 70b and
[Mistral 7b](/mistralai/Mistral-7B-Instruc...
Enhancing Open-Source LLMs with Function Calling FeatureWe're excited to announce that the Function Calling feature is now available on DeepInfra. We're offering Mistral-7B and Mixtral-8x7B models with this feature. Other models will be available soon.
LLM models are powerful tools for various tasks. However, they're limited in their ability to per...© 2025 Deep Infra. All rights reserved.