How Mixture of Experts Models Changed LLM Economics

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.05.26 by DeepInfra

Every open-weight model that has closed the gap with GPT-5.5 and Claude Opus 4.7 this year has one thing in common. DeepSeek V4-Pro: 1.6 trillion parameters, 49 billion active per token. Kimi K2.6: 1 trillion parameters, 32 billion active. GLM-5.1: 744 billion parameters, 40 billion active. MiniMax M2.7: large total parameter count, 10 billion active per token. The architectural pattern is the same across all of them — Mixture of Experts — and understanding it explains why these models are simultaneously much larger and much cheaper to run than the dense models they are competing with.

This is an explanation of what Mixture of Experts actually is, why it matters for model performance, which models use it, and what the practical implications are for teams building with these models.

The Problem with Dense Models at Scale

To understand why MoE exists, start with how a standard transformer works. In a dense model, every token you pass in — every word, every character, every piece of input — travels through every single parameter in the network. A 70B dense model uses all 70 billion parameters for every token. Always. Without exception.

This is not a bug. It is how the architecture works, and for models up to a certain size it is not a problem. But as you scale up, it becomes one. If you want to double the capability of a dense model, you roughly double the parameters, which roughly doubles the compute required for every forward pass. The cost of inference scales linearly with model size, and there is no escape from that linear scaling within a dense architecture.

The implication for frontier models is severe. A hypothetical dense model at 671 billion parameters — the equivalent total weight of DeepSeek-V3 — would require every token to pass through all 671 billion of those parameters at inference. At that scale, the compute cost makes the model economically nonviable for API-level serving at competitive prices. You either pay a lot per token, or you do not run the model at all.

Mixture of Experts solves this by breaking the relationship between total model capacity and per-token compute cost.

What Mixture of Experts Actually Is

The core idea is straightforward to describe, even if the engineering details are not. Instead of one monolithic network that handles everything, an MoE model replaces each feed-forward layer with a collection of smaller networks — the “experts” — plus a routing mechanism called a gating network. When a token arrives, the router looks at it and decides which experts are most relevant for processing that specific piece of input. Only those experts activate. The rest stay dormant.

In practice, modern MoE architectures activate a small fixed number of experts per token. DeepSeek-V3 uses 671 billion total parameters but activates only 37 billion per token — about 5.5%. Kimi K2.6 packs 1 trillion parameters and activates 32 billion per forward pass. MiniMax M2.7 activates just 10 billion parameters per token despite its much larger total capacity. The arithmetic this enables is the architecture’s key advantage: per-token compute is determined by the number of active parameters, not the total parameter count.

The routing mechanism — typically called the gating network or router — learns during training which experts are useful for which kinds of inputs. An expert that specializes in code generation activates more often when the input contains code. An expert that handles mathematical reasoning activates more often for quantitative problems. This specialization is not hand-designed; it emerges from training. The model learns to route inputs to the right experts because doing so produces better outputs.

One important clarification about the routing: modern MoE models do not route by “task type” at the coarse level of “this is a coding question.” The routing happens at the level of individual tokens within individual layers, and it is far more fine-grained than any human-readable categorization. What the literature calls “expert specialization” is a statistical tendency for certain experts to handle certain types of representations — not a hard partition by subject matter.

The Three Things MoE Changes

How much compute a forward pass requires. The most direct consequence of sparse activation is that the floating-point operations required per token are much lower than a dense model of the same total size. DeepSeek-V3 with 37 billion active parameters has computational costs per token closer to a 37B dense model than a 671B one. Google’s GLaM research showed that a 1.2-trillion-parameter MoE with 64 experts active achieved better zero-shot performance than a dense 175B model while using half the inference FLOPs. At scale, MoE is not just more efficient — it can reach higher quality at equal compute than a dense architecture.

How much capacity the model can develop at training time. Because only a fraction of parameters activate per token, MoE models can have far more total parameters than would be feasible in a dense architecture under the same training compute budget. More total parameters means more capacity for the model to store different types of knowledge and reasoning strategies. The model can “know more” without every forward pass paying for all of that knowledge. This is the key reason that the most capable open-weight models of 2026 are trillion-parameter MoE architectures: that scale would be economically impossible as dense models.

How the cost of additional capability scales. In a dense model, adding capability means adding parameters, which means adding compute to every forward pass. In an MoE model, you can add new experts — expanding total capacity — without increasing per-token compute, as long as you keep the number of active experts per token constant. This decoupling is what makes trillion-parameter models financially viable to serve via API at competitive prices.

The Catch: Memory Is Not Free

There is an important caveat that any honest explanation of MoE has to include. While MoE saves on computation per token, it does not save on memory. The router needs access to all experts to decide which ones to activate — it cannot predict in advance which experts will be needed for a given token. This means all expert weights must be loaded into GPU memory before inference begins.

DeepSeek-V3 at 671 billion total parameters requires substantially more GPU memory than a 37B dense model, even though its per-token compute is comparable. Running V4-Pro’s 1.6 trillion parameters locally requires a serious GPU cluster. The MoE architecture is not magic: it shifts the constraint from compute to memory, which is a favorable trade for served API inference (where the memory cost is amortized across many requests) but a real challenge for local deployment.

There is also the routing overhead at large batch sizes. When many tokens are being processed simultaneously across many GPUs, the router’s decision about which expert to send each token to requires communication across devices. For high-batch-size serving — the normal case for a popular API — this “all-to-all” communication pattern can partially offset the per-token compute savings. The practical implication is that MoE’s efficiency advantage is most pronounced at low-to-moderate batch sizes and becomes more complicated at the extreme batch sizes of large-scale serving. Infrastructure providers running thousands of concurrent requests have to engineer carefully around this.

Which Models Use MoE in 2026

MoE has become the dominant architecture for frontier-class open-weight models. The question is no longer “does this model use MoE?” but “how is it configured?”

Model	Total Parameters	Active per Token	Active %	Experts	Architecture Note
DeepSeek V4-Pro	1.6T	49B	~3%	Fine-grained MoE	mHC for training stability
Kimi K2.6	1T	32B	~3%	384 total, 8 active per pass	MuonClip optimizer
GLM-5.1	744B	40B	~5%	DeepSeek Sparse Attention	NVIDIA-independent (Ascend)
DeepSeek V3 / V4-Flash	671B	37B	~5.5%	256 routed experts	MLA attention
Llama 4 Maverick	400B	17B	~4.3%	128 routed experts	Meta’s first MoE flagship
MiniMax M2.7	Large	10B	low	Ultra-sparse	Lowest active-param count
Mixtral 8×22B	141B	39B	~28%	8 experts, 2 active	Mistral’s original MoE
GPT-4 (rumored)	~1.8T	~111B	~6%	16 experts	OpenAI, unconfirmed

The configuration choices in that table reflect different engineering priorities. Kimi K2.6’s 384-expert design with 8 active per pass is unusually high-granularity — research has shown that increasing expert count while holding active parameters constant improves expressivity, and K2.6 is built specifically for the coding and agentic tasks where that expressivity matters. GLM-5.1’s 40B active parameters — the highest in this comparison — reflects a deliberate choice to retain more general-purpose capability rather than maximizing sparsity. MiniMax M2.7’s 10B active parameters is the opposite extreme: maximum sparsity, optimized for inference speed and cost at the expense of raw capability headroom.

Dense models are not absent from the frontier. Claude Opus 4.7 and Gemini 3.1 Pro are widely believed to use dense or hybrid architectures with proprietary details not disclosed. The performance data shows they remain competitive overall, which suggests that dense architectures with sufficient compute budget can match MoE quality — the difference is what that compute costs to serve at scale.

What Granularity and Expert Count Actually Control

The specific MoE configuration choices are not arbitrary. They determine three things simultaneously: the model’s per-token compute cost, its total memory footprint, and its effective capacity for specialization.

Increasing the number of experts while keeping active experts per token constant raises total parameter count (and therefore memory requirements) without changing per-token compute. You get more capacity for the same inference cost, as long as you can afford the memory. This is why frontier MoE models have been growing total parameter counts aggressively while keeping active parameters roughly stable.

Decreasing the size of individual experts while adding more of them — higher granularity — improves the model’s ability to specialize representations. Recent research has shown that finer-grained MoE architectures achieve exponentially better expressivity than coarser-grained ones at the same active parameter count. DeepSeek’s design philosophy, carried through from V2 to V4, emphasizes fine-grained experts with many more total experts than western architectures typically use.

The number of active experts per token controls the tradeoff between specialization and generalization. With only two experts active per token, the model must commit to narrow specialization — the routing decision is high-stakes. With eight or sixteen experts active, the model has more flexibility to blend representations from multiple specialists. This matters in practice: models with very low active-expert counts can struggle on tasks that span multiple domains simultaneously, since the routing mechanism cannot easily blend expertise across domains in a single pass.

Why This Architecture Determines API Pricing

The connection between MoE configuration and API pricing is direct. What API providers pay to serve a model is primarily determined by GPU utilization per token, which is primarily determined by active parameters per token. A model with 37B active parameters costs roughly the same per token to serve as a 37B dense model, regardless of how many total parameters it has. The total parameter count determines how much memory the deployment requires, but not how much compute each request consumes.

This is what enables the pricing dynamics in the current market. DeepSeek V4-Pro at $3.48/M output tokens is not cheap despite having 1.6 trillion total parameters — it is cheap because only 49 billion of those parameters activate per token. The per-token compute cost is closer to a 50B model than a 1.6T one. At the same time, the total capacity of a 1.6T model is available when the right experts are routed to for the right inputs.

MiniMax M2.7’s pricing at $1.20/M output — the cheapest in the current lineup of serious models — follows the same logic pushed to an extreme. At 10B active parameters, the per-token compute cost is extremely low, which enables the low price. The capability ceiling is correspondingly lower, but for high-volume use cases where that ceiling is not reached, the economics are compelling.

The models that cannot compete on price — Claude Opus 4.7 at $25/M output, GPT-5.5 at $30/M — are bearing the cost of serving at whatever density their architectures require. Whether that premium is worth it depends on whether the capability delta matters for your specific workload. MoE is not a magic capability multiplier; it is an efficiency multiplier that lets open-weight labs get more capability per dollar of training and serving compute. The resulting capability, for the specific tasks where frontier performance is required, is genuinely different — and the pricing data shows it.

The Practical Upshot

MoE architecture is the mechanism that explains the most important trend in the current model landscape: frontier-adjacent capability at a fraction of frontier pricing. It is not magic, and it has real costs — memory requirements, routing complexity, and the potential for capability gaps when tasks require expertise blending that fine-grained routing does not naturally support.

What it is: an engineering approach that decouples total model capacity from per-token inference cost, making trillion-parameter models economically viable to train and serve. Every open-weight model that has closed the benchmark gap with GPT-5.5 and Claude Opus 4.7 this year has done so with MoE. The pricing differential those models offer over closed-source alternatives is not a margin decision by their labs — it is a consequence of the architecture that makes their scale possible at all.

For teams running API-based AI workflows, the architecture is not a concern you need to manage directly. But understanding why these models are priced the way they are — and why the combination of large total capacity and low active-parameter cost produces competitive benchmark performance at API pricing that would have been implausible twelve months ago — is useful context for making infrastructure decisions that will compound over time.

Hosted Agents: your own always-on AI agent, from $13/monthOne click gives you a dedicated, isolated AI agent, pre-wired to fast inference and ready to work the moment it boots. No VMs, no SSH hardening, no patching. From $13/month, and idle is free.

Qwen3.5 0.8B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 0.8B (Reasoning) Qwen3.5 0.8B is part of Alibaba Cloud’s Qwen3.5 Small Model Series, released on March 2, 2026. Designed under the philosophy of “More Intelligence, Less Compute,” it targets edge devices, mobile phones, and low-latency applications where battery life and memory constraints are critical. It employs an Efficient Hybrid Architecture combining Gated Delta […]</p>

Kimi K2.6 Model Overview: Architecture, Features & Capabilities<p>Kimi K2.6 is Moonshot AI’s latest flagship open-source model, released on April 20, 2026 under a Modified MIT license. It is a native multimodal agentic model built on a 1-trillion parameter Mixture-of-Experts (MoE) architecture, with 32 billion parameters activated per token. The model is designed for long-horizon coding, autonomous execution, and multi-agent orchestration, and is […]</p>

View all