We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

DeepSeek V4 Flash vs Qwen3.6 vs GLM-4.6 Benchmarks

Published on 2026.07.01 by DeepInfra

A breakdown of three open-weight models across intelligence, speed, and inference cost.

Three open-weight models cover most of what a developer needs from open inference right now: DeepSeek V4 Flash, Qwen3.6 35B A3B, and GLM-4.6. All three run on DeepInfra, and all three use a Mixture-of-Experts design that keeps active parameters low while total capacity stays high. Past that shared backbone, they diverge. The teams behind them tuned each model for a different job.

DeepSeek V4 Flash runs cost-efficient reasoning at scale, carrying 284B total parameters but activating only 13B per token. Qwen3.6 35B A3B handles agentic coding and multimodal input, routing 3B of its 35B parameters per token across 256 experts. GLM-4.6 targets tool use and long-context retrieval, with a 200k token window built for agent loops.

This article compares the three across intelligence, coding, agent performance, speed, and inference cost on DeepInfra.

DeepSeek V4 vs. Qwen3.6 vs. GLM 4.6: At a Glance

	DeepSeek V4 Flash	Qwen3.6 35B A3B	GLM-4.6
Architecture	Hybrid CSA + HCA attention with MoE	Gated DeltaNet + Gated Attention + sparse MoE, 256 experts	Mixture-of-Experts
Total Parameters	284B	35B	357B
Active Parameters per Token	13B	3B	32B
Context Window	1M tokens	262k (1M with YaRN)	200k tokens
Multimodal Input	Text only	Text, image, video	Text only
License	MIT	Apache 2.0	MIT
Designed For	Cost-efficient reasoning at scale	Agentic coding and multimodal input	Tool use, agents, long-context RAG
Intelligence Index (AA)	37	33	25
Input Price on DeepInfra	$0.10	$0.15	$0.43
Output Price on DeepInfra	$0.20	$0.95	$1.74
Cache Hit Price	$0.02	N/A	$0.08
Speed on DeepInfra (tokens/sec)	23 (high-effort reasoning mode)	121	39
Latency on DeepInfra (TTFT)	0.93s	0.7s	1.0s
Ideal Use Cases	Long-document reasoning on a budget, high-volume batch	Multimodal agents, fast coding	Tool-heavy agents, long-context retrieval

What Each Model Is Designed To Do

DeepSeek V4 Flash: Cheap Reasoning at Scale

DeepSeek built V4 Flash to run reasoning cheaply. The model carries 284B total parameters but activates 13B per token, so you get a large knowledge base at the computational cost of a small model. It shipped April 24, 2026, under an MIT license and sits in the broader DeepSeek V4 line.

Cut the cost of reasoning inference at high volume
Hold a 1M token context for long documents
Keep active parameters low through MoE routing
Stay open under MIT with no commercial limits

It fits teams running reasoning workloads where the token bill, not peak quality, sets the budget

Qwen3.6 35B A3B: A Small Multimodal Agent Model

The Qwen team built 3.6 35B A3B for agents who see. It takes text, image, and video input, runs agentic coding, and activates 3B of 35B parameters per token. It was released in April 2026 under Apache 2.0.

Run agentic coding with a small active footprint
Accept image and video input alongside text
Scale context to 1M tokens with YaRN
Route across 256 experts, 8 routed plus 1 shared, per token

It suits developers building multimodal agents who want one model for vision and code.

GLM-4.6: Built for Tools and Long Context

Z AI built GLM-4.6 for agents that call tools and read long inputs. It targets tool use, agent workflows, and long-context RAG, with a 200k token window. It launched in September 2025 as an open-weight release under the MIT license.

Drive multi-step agent loops with reliable tool calls
Hold 200k tokens for retrieval over long inputs
Improve on GLM-4.5 across coding and reasoning
Ship as open weights for self-hosting

It fits teams whose agents depend on tool use and large retrieval contexts.

The Difference in Architecture

DeepSeek V4 Flash

DeepSeek V4 Flash combines two attention mechanisms, CSA and HCA, on top of a Mixture-of-Experts backbone. The model holds 284B total parameters and routes 13B of them per token. Its context window reaches 1M tokens, and DeepInfra serves it in FP4.

You pay for 13B active parameters per token, not 284B. That math holds the input price at $0.10 on DeepInfra while the model still reasons over a 1M token window. For long-document reasoning on a budget, those two numbers shape the decision more than peak quality does.

Qwen3.6 35B A3B

Qwen3.6 35B A3B stacks three pieces: Gated DeltaNet linear attention, Gated Attention, and a sparse MoE layer with 256 experts. Each token activates 8 routed experts plus 1 shared expert, which pulls 3B parameters from the 35B total. The native context runs to 262k tokens and stretches to 1M with YaRN. DeepInfra serves it in FP8.

Linear attention cuts the memory cost of long sequences, so the model holds context without the quadratic blowup of standard attention. It also takes image and video input, so you build multimodal agents on one model instead of bolting a vision model onto a text one. At 121 tokens per second on DeepInfra, it runs faster than the other two here.

GLM-4.6

GLM-4.6 from Z AI runs a Mixture-of-Experts design tuned for tool calling and retrieval over long inputs. Its context window holds 200k tokens. It improves on GLM-4.5 across four areas: context length, coding, reasoning, and tool use. DeepInfra serves it in FP4.

That tuning shows up in agent loops where the model calls a tool, reads the result, and picks the next step. A 200k window fits large retrieval contexts without chunking tricks. At $0.43 input and $1.74 output on DeepInfra, it costs more than the other two, so its value depends on how hard you lean on its tool-use and RAG strengths.

Benchmark Performance

Intelligence Index

The Intelligence Index from Artificial Analysis folds reasoning, knowledge, and math scores into one number.

Source

DeepSeek V4 Flash scores 37, the highest of the three, at a $0.10 input price on DeepInfra.
Qwen3.6 35B A3B scores 33 at a higher output price.
GLM-4.6 scores 25, also at a higher output price.

These scores are Artificial Analysis estimates. Independent evaluations for all three models are still pending.

For index points per dollar, DeepSeek V4 Flash is the clear winner. Move to Qwen or GLM only if you need their specific capabilities.

Coding Performance

All three models perform well on coding tasks, but reach their scores differently. On SciCode, which measures scientific coding quality, DeepSeek leads at 42%, GLM follows at 38%, and Qwen sits at 36%. On AA-LCR, which measures reasoning over long inputs and is directly relevant to working across large codebases, Qwen takes the lead at 64%, DeepSeek follows at 63%, and GLM sits at 54%.

Source

DeepSeek leads on raw coding quality, but Qwen pulls ahead when the task requires reasoning across a long context, which is where most real repository work happens.
GLM scores below both on long context reasoning despite its 200k token window. Window size alone does not close the gap.

For short, focused coding tasks, DeepSeek is the default. For agentic coding across large codebases, Qwen’s long context reasoning advantage matters.

Agent Workflows

Terminal-Bench Hard measures agentic coding and real terminal use, the kind of multi-step tasks an agent actually runs. DeepSeek leads at 39%, Qwen follows at 35%, and GLM trails at 25%.

Source

DeepSeek and Qwen are closer here than GLM, which scores 10 points below Qwen despite tool use being a core GLM-4.6 design goal.
GLM’s agent strength shows more in tool calling and retrieval loops than in terminal-based agentic coding tasks.

For terminal-driven agent workflows, DeepSeek and Qwen are the stronger picks. Test GLM on your specific tool stack before ruling it out for retrieval-heavy agent patterns.

Speed and Latency on DeepInfra

Qwen3.6 35B A3B generates at 121 tokens per second and hits the first token in 0.7s. That speed comes from its 3B active parameters, which is the smallest active count among the three models. This efficiency showcases how sparse architecture maintains high throughput during long sequence tasks. GLM-4.6 runs at 39 tokens per second with a 1.0s TTFT. DeepSeek V4 Flash is the slowest of the three at 23 tokens per second and 0.93s TTFT, since high-effort reasoning burns most of its time on thinking before generating output. That 23 tokens per second reflects DeepSeek running in high-effort reasoning mode. In non-reasoning mode, its throughput is significantly higher.

Source

Latency tells a similar story. Qwen reaches the first token fastest at 0.7s, DeepSeek follows at 0.93s, and GLM trails at 1.0s. The gap is small in absolute terms, but compounds across high-frequency agent calls.

Source

Qwen wins on raw speed by a wide margin, more than 3x GLM and 5x DeepSeek. If your users watch streamed output, Qwen’s 121 tokens per second reads noticeably faster on screen. If you run batch jobs where the bill matters more than wall-clock time, DeepSeek’s $0.10 input price changes the calculation.

Pricing on DeepInfra

DeepSeek V4 Flash costs $0.10 per 1M input tokens and $0.20 per 1M output on DeepInfra, with cache hits dropping to $0.02. Qwen3.6 35B A3B runs $0.15 input and $0.95 output. GLM-4.6 sits at $0.43 input and $1.74 output, the highest of the three, with cache hits at $0.08.

Source

Output tokens drive the bill in reasoning workloads, and that gap is significant. GLM’s $1.74 output rate runs almost 9x DeepSeek’s $0.20, with Qwen in the middle at $0.95. A reasoning model emits long chains of thought, so output price compounds across every request. DeepSeek’s $0.20 output rate plus $0.02 cache hits make it the cheapest to run at volume. You pay Qwen’s output premium for its speed and multimodal input. You pay GLM’s for tool use and retrieval quality. The benchmarks here do not fully capture structured tool calling and search-agent loops, which is where GLM performs best. Test it on your own stack before ruling it out.

Cost-Performance Tradeoffs

Choose DeepSeek V4 Flash if you want:

The lowest inference cost here is $0.10 input and $0.20 output, with $0.02 cache hits
A 1M token context for long-document reasoning
High-effort reasoning where price beats tokens per second
An MIT license with no usage restrictions

Choose Qwen3.6 35B A3B if you want:

The fastest output here is 121 tokens per second on DeepInfra
Image and video input on one model
Agentic coding with a 3B active footprint
Apache 2.0 licensing and context up to 1M with YaRN

Choose GLM-4.6 if you want:

Tool use and agent loops tuned past GLM-4.5
A 200k context window for long-context RAG
Stronger coding and reasoning than the prior GLM release
Open weights for self-host flexibility

Final Thoughts

Pick based on what your workload actually needs. DeepSeek V4 Flash is the default for cost-sensitive reasoning at scale. Qwen3.6 35B A3B is the pick for speed and multimodal input. GLM-4.6 earns its higher price only inside tool-heavy agent loops and large retrieval pipelines, so test it on your own stack before committing.

All three run on DeepInfra under permissive licenses. The best one is whichever fits your budget and use case, not whichever scores highest on a benchmark.

Kimi K2 0905 API from Deepinfra: Practical Speed, Predictable Costs, Built for Devs - Deep InfraKimi K2 0905 is Moonshot’s long-context Mixture-of-Experts update designed for agentic and coding workflows. With a context window up to ~256K tokens, it can ingest large codebases, multi-file documents, or long conversations and still deliver structured, high-quality outputs. But real-world performance isn’t defined by the model alone—it’s determined by the inference provider that serves it: […]

Reliable JSON-Only Responses with DeepInfra LLMsWhen large language models are used inside real applications, their role changes fundamentally. Instead of chatting with users, they become infrastructure components: extracting information, transforming text, driving workflows, or powering APIs. In these scenarios, natural language is no longer the desired output. What applications need is structured data — and very often, that structure is […]

Build a RAG App With DeepInfra and LangChainAsk a base language model about your company’s refund policy and it will answer with confidence, fluency, and no idea what your policy actually says. The facts live in your PDFs, your internal wiki, and your ticket history, none of which the model has ever seen during training. Retrieval-augmented generation closes that gap by fetching […]

View all