We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

DeepSeek V4 Flash vs Qwen3.6 vs GLM-4.6 Benchmarks
Published on 2026.07.01 by DeepInfra
DeepSeek V4 Flash vs Qwen3.6 vs GLM-4.6 Benchmarks

A breakdown of three open-weight models across intelligence, speed, and inference cost. 

Three open-weight models cover most of what a developer needs from open inference right now: DeepSeek V4 Flash, Qwen3.6 35B A3B, and GLM-4.6. All three run on DeepInfra, and all three use a Mixture-of-Experts design that keeps active parameters low while total capacity stays high. Past that shared backbone, they diverge. The teams behind them tuned each model for a different job.

DeepSeek V4 Flash runs cost-efficient reasoning at scale, carrying 284B total parameters but activating only 13B per token. Qwen3.6 35B A3B handles agentic coding and multimodal input, routing 3B of its 35B parameters per token across 256 experts. GLM-4.6 targets tool use and long-context retrieval, with a 200k token window built for agent loops.

This article compares the three across intelligence, coding, agent performance, speed, and inference cost on DeepInfra.

DeepSeek V4 vs. Qwen3.6 vs. GLM 4.6: At a Glance

DeepSeek V4 FlashQwen3.6 35B A3BGLM-4.6
ArchitectureHybrid CSA + HCA attention with MoEGated DeltaNet + Gated Attention + sparse MoE, 256 expertsMixture-of-Experts
Total Parameters284B35B357B
Active Parameters per Token13B3B32B
Context Window1M tokens262k (1M with YaRN)200k tokens
Multimodal InputText onlyText, image, videoText only
LicenseMITApache 2.0MIT
Designed ForCost-efficient reasoning at scaleAgentic coding and multimodal inputTool use, agents, long-context RAG
Intelligence Index (AA)373325
Input Price on DeepInfra$0.10$0.15$0.43
Output Price on DeepInfra$0.20$0.95$1.74
Cache Hit Price$0.02N/A$0.08
Speed on DeepInfra (tokens/sec)23 (high-effort reasoning mode)12139
Latency on DeepInfra (TTFT)0.93s0.7s1.0s
Ideal Use CasesLong-document reasoning on a budget, high-volume batchMultimodal agents, fast codingTool-heavy agents, long-context retrieval

What Each Model Is Designed To Do

DeepSeek V4 Flash: Cheap Reasoning at Scale

DeepSeek built V4 Flash to run reasoning cheaply. The model carries 284B total parameters but activates 13B per token, so you get a large knowledge base at the computational cost of a small model. It shipped April 24, 2026, under an MIT license and sits in the broader DeepSeek V4 line.

  • Cut the cost of reasoning inference at high volume
  • Hold a 1M token context for long documents
  • Keep active parameters low through MoE routing
  • Stay open under MIT with no commercial limits

It fits teams running reasoning workloads where the token bill, not peak quality, sets the budget

Qwen3.6 35B A3B: A Small Multimodal Agent Model

The Qwen team built 3.6 35B A3B for agents who see. It takes text, image, and video input, runs agentic coding, and activates 3B of 35B parameters per token. It was released in April 2026 under Apache 2.0.

  • Run agentic coding with a small active footprint
  • Accept image and video input alongside text
  • Scale context to 1M tokens with YaRN
  • Route across 256 experts, 8 routed plus 1 shared, per token

It suits developers building multimodal agents who want one model for vision and code.

GLM-4.6: Built for Tools and Long Context

Z AI built GLM-4.6 for agents that call tools and read long inputs. It targets tool use, agent workflows, and long-context RAG, with a 200k token window. It launched in September 2025 as an open-weight release under the MIT license.

  • Drive multi-step agent loops with reliable tool calls
  • Hold 200k tokens for retrieval over long inputs
  • Improve on GLM-4.5 across coding and reasoning
  • Ship as open weights for self-hosting

It fits teams whose agents depend on tool use and large retrieval contexts.

The Difference in Architecture

DeepSeek V4 Flash

DeepSeek V4 Flash combines two attention mechanisms, CSA and HCA, on top of a Mixture-of-Experts backbone. The model holds 284B total parameters and routes 13B of them per token. Its context window reaches 1M tokens, and DeepInfra serves it in FP4.

You pay for 13B active parameters per token, not 284B. That math holds the input price at $0.10 on DeepInfra while the model still reasons over a 1M token window. For long-document reasoning on a budget, those two numbers shape the decision more than peak quality does.

Qwen3.6 35B A3B

Qwen3.6 35B A3B stacks three pieces: Gated DeltaNet linear attention, Gated Attention, and a sparse MoE layer with 256 experts. Each token activates 8 routed experts plus 1 shared expert, which pulls 3B parameters from the 35B total. The native context runs to 262k tokens and stretches to 1M with YaRN. DeepInfra serves it in FP8.

Linear attention cuts the memory cost of long sequences, so the model holds context without the quadratic blowup of standard attention. It also takes image and video input, so you build multimodal agents on one model instead of bolting a vision model onto a text one. At 121 tokens per second on DeepInfra, it runs faster than the other two here.

GLM-4.6

GLM-4.6 from Z AI runs a Mixture-of-Experts design tuned for tool calling and retrieval over long inputs. Its context window holds 200k tokens. It improves on GLM-4.5 across four areas: context length, coding, reasoning, and tool use. DeepInfra serves it in FP4.

That tuning shows up in agent loops where the model calls a tool, reads the result, and picks the next step. A 200k window fits large retrieval contexts without chunking tricks. At $0.43 input and $1.74 output on DeepInfra, it costs more than the other two, so its value depends on how hard you lean on its tool-use and RAG strengths.

Benchmark Performance

Intelligence Index

The Intelligence Index from Artificial Analysis folds reasoning, knowledge, and math scores into one number.

Source

  • DeepSeek V4 Flash scores 37, the highest of the three, at a $0.10 input price on DeepInfra.
  • Qwen3.6 35B A3B scores 33 at a higher output price.
  • GLM-4.6 scores 25, also at a higher output price.

These scores are Artificial Analysis estimates. Independent evaluations for all three models are still pending.

For index points per dollar, DeepSeek V4 Flash is the clear winner. Move to Qwen or GLM only if you need their specific capabilities.

Coding Performance

All three models perform well on coding tasks, but reach their scores differently. On SciCode, which measures scientific coding quality, DeepSeek leads at 42%, GLM follows at 38%, and Qwen sits at 36%. On AA-LCR, which measures reasoning over long inputs and is directly relevant to working across large codebases, Qwen takes the lead at 64%, DeepSeek follows at 63%, and GLM sits at 54%. 

   Source

  • DeepSeek leads on raw coding quality, but Qwen pulls ahead when the task requires reasoning across a long context, which is where most real repository work happens.
  • GLM scores below both on long context reasoning despite its 200k token window. Window size alone does not close the gap.

For short, focused coding tasks, DeepSeek is the default. For agentic coding across large codebases, Qwen’s long context reasoning advantage matters.

Agent Workflows

Terminal-Bench Hard measures agentic coding and real terminal use, the kind of multi-step tasks an agent actually runs. DeepSeek leads at 39%, Qwen follows at 35%, and GLM trails at 25%. 

     Source

  • DeepSeek and Qwen are closer here than GLM, which scores 10 points below Qwen despite tool use being a core GLM-4.6 design goal.
  • GLM’s agent strength shows more in tool calling and retrieval loops than in terminal-based agentic coding tasks.

For terminal-driven agent workflows, DeepSeek and Qwen are the stronger picks. Test GLM on your specific tool stack before ruling it out for retrieval-heavy agent patterns.

Speed and Latency on DeepInfra

Qwen3.6 35B A3B generates at 121 tokens per second and hits the first token in 0.7s. That speed comes from its 3B active parameters, which is the smallest active count among the three models. This efficiency showcases how sparse architecture maintains high throughput during long sequence tasks. GLM-4.6 runs at 39 tokens per second with a 1.0s TTFT. DeepSeek V4 Flash is the slowest of the three at 23 tokens per second and 0.93s TTFT, since high-effort reasoning burns most of its time on thinking before generating output. That 23 tokens per second reflects DeepSeek running in high-effort reasoning mode. In non-reasoning mode, its throughput is significantly higher. 

Source

Latency tells a similar story. Qwen reaches the first token fastest at 0.7s, DeepSeek follows at 0.93s, and GLM trails at 1.0s. The gap is small in absolute terms, but compounds across high-frequency agent calls.

Source

Qwen wins on raw speed by a wide margin, more than 3x GLM and 5x DeepSeek. If your users watch streamed output, Qwen’s 121 tokens per second reads noticeably faster on screen. If you run batch jobs where the bill matters more than wall-clock time, DeepSeek’s $0.10 input price changes the calculation. 

Pricing on DeepInfra

DeepSeek V4 Flash costs $0.10 per 1M input tokens and $0.20 per 1M output on DeepInfra, with cache hits dropping to $0.02. Qwen3.6 35B A3B runs $0.15 input and $0.95 output. GLM-4.6 sits at $0.43 input and $1.74 output, the highest of the three, with cache hits at $0.08.

Source

Output tokens drive the bill in reasoning workloads, and that gap is significant. GLM’s $1.74 output rate runs almost 9x DeepSeek’s $0.20, with Qwen in the middle at $0.95. A reasoning model emits long chains of thought, so output price compounds across every request. DeepSeek’s $0.20 output rate plus $0.02 cache hits make it the cheapest to run at volume. You pay Qwen’s output premium for its speed and multimodal input. You pay GLM’s for tool use and retrieval quality. The benchmarks here do not fully capture structured tool calling and search-agent loops, which is where GLM performs best. Test it on your own stack before ruling it out.

Cost-Performance Tradeoffs

Choose DeepSeek V4 Flash if you want:

  • The lowest inference cost here is $0.10 input and $0.20 output, with $0.02 cache hits
  • A 1M token context for long-document reasoning
  • High-effort reasoning where price beats tokens per second
  • An MIT license with no usage restrictions

Choose Qwen3.6 35B A3B if you want:

  • The fastest output here is 121 tokens per second on DeepInfra
  • Image and video input on one model
  • Agentic coding with a 3B active footprint
  • Apache 2.0 licensing and context up to 1M with YaRN

Choose GLM-4.6 if you want:

  • Tool use and agent loops tuned past GLM-4.5
  • A 200k context window for long-context RAG
  • Stronger coding and reasoning than the prior GLM release
  • Open weights for self-host flexibility

Final Thoughts

Pick based on what your workload actually needs. DeepSeek V4 Flash is the default for cost-sensitive reasoning at scale. Qwen3.6 35B A3B is the pick for speed and multimodal input. GLM-4.6 earns its higher price only inside tool-heavy agent loops and large retrieval pipelines, so test it on your own stack before committing.

All three run on DeepInfra under permissive licenses. The best one is whichever fits your budget and use case, not whichever scores highest on a benchmark.

Related articles
Function Calling in DeepInfra: Extend Your AI with Real-World LogicFunction Calling in DeepInfra: Extend Your AI with Real-World Logic<p>Modern large language models (LLMs) are incredibly powerful at understanding and generating text, but until recently they were largely static: they could only respond based on patterns in their training data. Function calling changes that. It lets language models interact with external logic — your own code, APIs, utilities, or business systems — while still [&hellip;]</p>
Qwen3.5 27B API Benchmarks: Latency, Throughput & CostQwen3.5 27B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 27B (Reasoning) Qwen3.5 27B is part of Alibaba Cloud&#8217;s latest-generation foundation model family, released in February 2026. Unlike the Mixture-of-Experts variants in the Qwen3.5 series, the 27B model uses a dense architecture combining Gated Delta Networks and Feed Forward Networks. It achieves strong benchmark scores including MMLU-Pro (86.1%), GPQA Diamond (85.5%), and SWE-bench [&hellip;]</p>
Best SaaS Platforms for Deploying Gemma 4 in 2026Best SaaS Platforms for Deploying Gemma 4 in 2026<p>Gemma 4 is available across a range of platforms — from fully managed API providers to local runners and no-code builders. The right choice depends on what you&#8217;re optimizing for: cost, latency, data privacy, local execution, or zero infrastructure overhead. This guide breaks down the top options by use case so you can match the [&hellip;]</p>