GLM-4.6 API: Get fast first tokens at the best $/M from Deepinfra's API

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

GLM-4.6 API: Get fast first tokens at the best $/M from Deepinfra's API - Deep Infra

Published on 2025.12.01 by DeepInfra

GLM-4.6 is a high-capacity, “reasoning”-tuned model that shows up in coding copilots, long-context RAG, and multi-tool agent loops. With this class of workload, provider infrastructure determines perceived speed (first-token time), tail stability, and your unit economics. Using ArtificialAnalysis (AA) provider charts for GLM-4.6 (Reasoning), DeepInfra (FP8) pairs a sub-second Time-to-First-Token (TTFT) (0.51 s) with the lowest output price ($1.9/M) among the cohort, while keeping competitive throughput at 100k-token prompts. Baseten wins a few raw-speed trophies (fastest TTFT and the highest t/s), but charges more per output token.

GLM-4.6 Model in Context

GLM-4.6 is Zhipu’s latest general-purpose reasoning model. AA evaluates it with input sizes up to 100k tokens, which makes the provider’s long-context behavior particularly relevant for RAG over large docs/repos. The model slots well into:

IDE assistants & code review: interactive streaming and short agent steps; TTFT and throughput stability matter more than peak tokens/sec.
RAG and repo analysis: 100k-token prompts are realistic; you care about how TTFT/throughput scales with context.
Agent loops: multiple chained calls; p95/p99 latency and jitter dominate UX.

What GLM-4.6 doesn’t provide by itself: browsing, code execution, or file I/O—those depend on your orchestration layer. In sizing, pick it when you need stronger reasoning and very long contexts; otherwise, a cheaper/smaller model may be sufficient for bulk chat or classification.

Compared to its predecessor GLM-4.5, the newer version brings several key improvements:

Longer context (↑56%) — context window increases from 128K → 200K tokens, enabling larger RAG prompts, multi-file code reviews, and deeper agent traces without chunking.
Stronger coding — higher scores on public code evals and better behavior in practical IDE flows (e.g., Claude Code, Cline, Roo Code, Kilo Code), with noticeably cleaner front-end scaffolds and UI markup. (Exact benchmark deltas not reported.)
Reasoning + tool use — more reliable multi-step reasoning and native support for tool invocation during inference, improving success rates on retrieval/execute/verify loops.
Agent readiness — better performance in search- and tool-driven agents and smoother integration with common agent frameworks (fewer retries, more consistent step timing).
Refined writing — improved preference alignment for tone and structure; more natural persona/role-play outputs and clearer, easier-to-edit prose.

API Speed & Predictability

Time to First Token (TTFT)

TTFT is the elapsed time from request to the first streamed token. For chat UIs, IDEs, and step-wise agents, it’s the metric users feel.

Time to first token api comparison chart

API ranking & deltas vs DeepInfra (0.51 s):

Baseten: −59% faster (0.21 s vs 0.51 s).
Parasail (FP8): +11.8% slower (0.57 s).
Novita: +51.0% slower (0.77 s).
GMI: +121.6% slower (1.13 s).

DeepInfra sits in the sub-second group and is second-fastest here, within ~120 ms of the leader. Baseten is the right pick when you absolutely need the earliest possible first token. Everyone else pays a noticeable TTFT tax relative to DeepInfra.

Predictability (variance)

Predictability is about how steady the stream is, not just how fast it starts. The variance plot for output speed (tokens/sec) illustrates this directly: the median indicates typical throughput, the IQR (box) captures routine jitter, and the whiskers/tails highlight rare slowdowns that exceed your SLOs. Volatility in tokens per seconds inflates p95/p99 latency even when TTFT is fine—users feel that as pauses, tool timeouts, or retries.

How to read it:

Tight box + short upper whisker → stable streaming, predictable E2E, smoother agent cadence.
Wide box or long whiskers → throughput swings; expect spiky p95/p99, backoffs, and higher effective cost per action.
Higher median t/s at the same spread → faster completions without sacrificing reliability.

Pick providers that combine high median t/s with compact spread; they’ll miss latency targets less often and keep agents moving consistently.

Baseten clearly leads on median tokens/sec, but DeepInfra and Novita form a stable middle that’s ~30–35% faster than Parasail’s median and ~14% faster than GMI’s. In practice, that means fewer “stalls” during streaming and tighter loop times in agent chains.

Scaling with context (100 / 1k / 10k / 100k input tokens)

AA reports output speed (t/s) across increasing input sizes. For long-context RAG and repo analysis, the 100k column is the differentiator. This chart tells you how much the performance declines for longer input contexts

What it means.

Baseten dominates throughput across sizes and maintains 97 t/s even at 100k input tokens.
DeepInfra keeps a solid 48 t/s at 100k, which is +71% faster than Parasail (28 t/s), but behind GMI (78 t/s) and Baseten (97 t/s).
Novita lands in the middle (59 t/s at 100k).

If your product routinely sends 100k-token prompts (large PDFs, monorepo sweeps), Baseten’s raw throughput minimizes streaming time; DeepInfra, however, combines fast TTFT with middle-to-high throughput and the lowest price—often the better speed-per-dollar choice.

GLM 4.6 API Pricing

Why $/M interacts with performance

Two providers separated by $0.3/M may look close, but at scale (hundreds of millions of tokens) it’s real money. And when you pair a cheaper provider with faster TTFT, you enable tighter agent loops—users send more prompts per minute without feeling lag.

Input/Output prices (AA)

AA shows identical input price for all five providers and different output prices:

Input ($/M): $0.6/M for all (DeepInfra, GMI, Parasail, Novita, Baseten).
Output ($/M): DeepInfra $1.9/M, GMI $2.0/M, Parasail $2.1/M, Novita $2.2/M, Baseten $2.2/M.

Bottom line: DeepInfra is cheapest on output at $1.9/M—a 5–14% advantage over peers—while inputs are a wash at $0.6/M.

Cost example — 3,000 input / 1,000 output tokens

Use the asymmetric formula:

Then you get these prices for the different providers:

Provider	Output $/M	Total cost for 3k/1k
DeepInfra	1.9	$0.00370
GMI	2.0	$0.00380 (+2.7% vs DeepInfra)
Parasail (FP8)	2.1	$0.00390 (+5.4%)
Novita	2.2	$0.00400 (+8.1%)
Baseten	2.2	$0.00400 (+8.1%)

End-to-End response time vs price

AA plots end-to-end seconds to stream 500 output tokens on the y-axis against the effective price ($ per 1M tokens) on the x-axis. E2E latency captures the whole user wait—TTFT + generation time—so it reflects both startup overhead and streaming throughput. The lower-left quadrant is the Pareto-efficient region: points there deliver the same work faster and cheaper. Providers that sit up-and-to-the-right are dominated (you’re paying more and waiting longer).

How to use the chart:

Have a latency SLO? Draw a horizontal line at your target (e.g., 10–12 s). Among providers below that line, pick the leftmost one to minimize cost.
Have a budget cap? Draw a vertical line at your $/M limit and choose the lowest point to minimize user wait.
Compare near-neighbors. When two points cluster, prefer the one slightly lower (latency wins) unless the other is meaningfully further left (material cost savings at similar speed).

Baseten lands lowest on E2E (fastest full response) around the $1.00/M price mark, clearly inside the attractive quadrant.
DeepInfra (FP8) sits near 50 s on E2E at the left edge of the cluster (≈$0.93/M on AA’s axis), faster than GMIand Parasail, roughly similar to Novita in this plot.

If your sole target is the fastest 500-token completion, Baseten leads near $1.00/M. If you want the best price/performance—sub-second TTFT, steady throughput, and the lowest $/M—DeepInfra (FP8) is the practical default; it’s cheaper, faster than GMI and Parasail in this view, and roughly on par with Novita. In all cases, pick the cheapest provider that clears your latency SLO.

Comparison table (one glance)

Provider	TTFT @1k (s)	Input $/M	Output $/M	Notes
DeepInfra (FP8)	0.51	0.6	1.9	100k throughput 48 t/s; median throughput 50.6 t/s.
Baseten	0.21	0.6	2.2	Fastest TTFT; 100k throughput 97 t/s; median throughput 113.4 t/s.
Novita	0.77	0.6	2.2	100k throughput 59 t/s; median 52.2 t/s.
GMI	1.13	0.6	2.0	100k throughput 78 t/s; median 43.6 t/s.
Parasail (FP8)	0.57	0.6	2.1	100k throughput 28 t/s; median 38.5 t/s.

Key takeaways

DeepInfra balances speed + price best. With 0.51 s TTFT and the cheapest $1.9/M output price, DeepInfra minimizes both time-to-interaction and cost per call. At 100k input tokens, it still pushes 48 t/s, which keeps long-document sessions usable.
Baseten is the raw-speed leader. It posts the fastest TTFT (0.21 s) and the highest throughput across contexts (up to 97 t/s at 100k), but it costs $2.2/M on output—~15% above DeepInfra. When every 100–300 ms of perceived speed matters and budget is flexible, this is the premium lane.
Novita is a steady middle-ground. Its TTFT (0.77 s) trails DeepInfra, but throughput variance and 100k throughput (59 t/s) are solid. Price is on the higher side ($2.2/M output).
GMI is slower on TTFT but quick in a long context. At 1.13 s TTFT, it trails DeepInfra by +122%, yet it delivers 78 t/s at 100k (second only to Baseten). If your workload is dominated by very long prompts with less sensitivity to first-token latency, GMI can do well—at a slight price premium to DeepInfra.
Parasail is price-adjacent but slower. TTFT 0.57 s is near DeepInfra, but throughput is lowest at 100k (28 t/s) and output price is higher ($2.1/M).
E2E vs price echoes the pattern. Baseten wins raw E2E; DeepInfra is at the left edge of the cluster with roughly mid-pack E2E—meaning good perceived speed at the lowest cost.

Why DeepInfra wins for GLM-4.6

For real users, responsiveness and steadiness trump peak benchmarks. In IDE assistants and code review, DeepInfra (FP8) clears the interactivity bar with a 0.51 s TTFT and a 50.6 t/s median streaming rate—fast enough to feel instant and smooth—while its $1.9/M output price yields a lower unit cost. In multi-tool agent loops, step time is effectively TTFT + (tokens ÷ t/s); DeepInfra’s combination of quick first tokens and predictable throughput keeps cadence tight and lets you execute more steps per dollar.

For long-context RAG, DeepInfra’s 48 t/s at 100k input tokens isn’t the raw-throughput leader, but the earlier first token versus high-throughput peers (e.g., GMI at 78 t/s) sustains an interactive feel in document-heavy sessions. If you can trade a modest throughput gap for better first-token latency and lower $/M, DeepInfra is the pragmatic default. On the FinOps side, a typical 3k-in / 1k-out call prices at $0.00370, 2.7–8.1% cheaper than the rest of this cohort—savings that compound materially at scale without sacrificing perceived speed. Net-net: unless you’re optimizing purely for maximum streaming rate regardless of cost, DeepInfra offers the best balance of latency, predictability, and price for GLM-4.6.

Disclaimer: This article reflects data and pricing as of October 15, 2025. Benchmarks and provider rates can change quickly, so please verify the linked sources and rerun representative evaluations before making decisions.

Inference LoRA adapter modelLearn how to inference LoRA adapter model.

Llama 3.1 70B Instruct API from DeepInfra: Snappy Starts, Fair Pricing, Production Fit - Deep Infra<p>Llama 3.1 70B Instruct is Meta’s widely-used, instruction-tuned model for high-quality dialogue and tool use. With a ~131K-token context window, it can read long prompts and multi-file inputs—great for agents, RAG, and IDE assistants. But how “good” it feels in practice depends just as much on the inference provider as on the model: infra, batching, […]</p>

Accelerating Reasoning Workflows with Nemotron 3 Nano on DeepInfraDeepInfra is an official launch partner for NVIDIA Nemotron 3 Nano, the newest open reasoning model in the Nemotron family. Our goal is to give developers, researchers, and teams the fastest and simplest path to using Nemotron 3 Nano from day one.

View all