Open-Source vs Closed-Source AI Models: Is the Gap Worth It?

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.05.26 by DeepInfra

The Artificial Analysis Intelligence Index sits at a ceiling of 57. Three frontier models — Claude Opus 4.7, Gemini 3.1 Pro Preview, and GPT-5.5 — all land in that band. Meanwhile, four open-weight models released between February and April 2026 now score 50 or above on the same index. A year ago, the best open-weight models were scoring in the low 30s. The capability story has changed faster than most teams have updated their infrastructure decisions.

This article is about the intelligence gap between open-source and closed-source models in May 2026, what independent benchmark data from Artificial Analysis actually shows, and what the pricing differential means for teams building at volume. We are not covering time-to-first-token or throughput benchmarks. Those matter for specific latency-sensitive applications, but they are not what determines whether a model earns its place in a production pipeline. The question we are answering is: how much capability are you giving up to move from a $25/M output model to a $3/M output model, and on what specific tasks does that gap show up?

The Benchmark That Actually Matters Here

Before getting into model comparisons, it is worth being precise about what the Artificial Analysis Intelligence Index measures and why it is the right lens for this comparison.

The AA Intelligence Index v4.0 is a composite score aggregating ten independent evaluations: GDPval-AA (economically valuable knowledge work in agentic contexts), τ²-Bench Telecom, Terminal-Bench Hard, SciCode, AA-LCR (long-context retrieval), AA-Omniscience, IFBench (instruction following), Humanity’s Last Exam, GPQA Diamond, and CritPt. It is not a vendor self-report — Artificial Analysis runs the evaluations independently. And critically, it is broad enough that it does not reward narrow specialization. A model that is exceptional at code generation but weak at instruction following and scientific reasoning will score lower than its SWE-Bench numbers suggest.

This is important because most vendor launch benchmarks tell a cherry-picked story. A model releases with impressive SWE-Bench numbers because that is the benchmark the lab optimized against, or the one that looks best in a press release. The AA Intelligence Index is harder to game because you would have to optimize across ten different evaluation types simultaneously.

The other benchmark that shows up repeatedly in this comparison is SWE-Bench Verified — the standard evaluation for autonomous software engineering, measuring how often a model can resolve real GitHub issues correctly. For teams building coding agents, it is the most practically relevant single-task benchmark. We will use both throughout.

Where the Models Actually Stand

Here is the current landscape for the models in this comparison, drawing from Artificial Analysis data as of May 2026:

Model	AA Intelligence Index	SWE-Bench Verified	GPQA Diamond	Input $/M	Output $/M	Context	License
GPT-5.5	~60	~82%	93.18%	$5.00	$30.00	1M	Proprietary
Claude Opus 4.7	57	82.0%+	90.15%	$5.00	$25.00	1M	Proprietary
Kimi K2.6	54	76.2%	89.14%	$0.75	$3.50	262K	Modified MIT
GLM-5.1	51	76.4%	84.52%	$1.05	$3.50	200K	MIT
DeepSeek V4-Pro	52	80.6%	90.1%	$1.74	$3.48	1M	MIT
MiniMax M2.7	50	73.80%	86.62%	$0.28	$1.20	197K	MIT
DeepSeek V4-Flash	47	79.0%	—	$0.14	$0.28	1M	MIT

Prices sourced from provider APIs and OpenRouter, May 2026. SWE-Bench scores from Artificial Analysis and vals.ai. AA Intelligence Index from Artificial Analysis leaderboard.

A few things in that table deserve more than a glance.

The first is the gap at the top. GPT-5.5 leads the AA Intelligence Index at around 60, followed by Claude Opus 4.7 at 57. These are genuinely the strongest models available by composite measure, and neither is close to any open-weight model on the full index. The gap between GPT-5.5 and Kimi K2.6 — 60 versus 54 — sounds small in absolute terms, but on a benchmark that aggregates ten demanding evaluations, a 6-point gap is meaningful. It shows up most clearly on pure reasoning (GPQA Diamond: 93.18% for GPT-5.5 versus 89.14% for K2.6) and mathematics.

The second is SWE-Bench specifically. On autonomous coding — the benchmark most relevant for agentic engineering workflows — the gap between the closed-source frontier and the best open-weight models has nearly disappeared. Claude Opus 4.7 leads at 82%, GPT-5.5 is just behind, and Kimi K2.6, DeepSeek V4-Pro, and DeepSeek V4-Flash sit within 3 percentage points of that ceiling. For teams whose primary use case is coding agents, this is the number that matters most, and the open-weight models have earned their place.

The third is MiniMax M2.7 specifically. It is the outlier in this group: a lower AA Intelligence Index score but an output price of $1.20/M — the cheapest of any model listed here that still reaches competitive scores on coding tasks. At $0.28/M input and $1.20/M output, it prices below even DeepSeek V4-Flash on output tokens, with roughly comparable SWE-Bench performance on straightforward tasks. We will come back to this.

The Closed-Source Case: What You Are Actually Paying For

Claude Opus 4.7 and GPT-5.5 are the two premium closed-source options at $25/M and $30/M output respectively. The honest case for paying that premium comes down to three things.

First, overall intelligence breadth. A 6-point gap on the AA Intelligence Index between GPT-5.5 and the best open-weight model is the result of open-weight models consistently underperforming on reasoning-intensive evaluations outside their training specialization. Kimi K2.6 scores 89.14% on GPQA Diamond versus GPT-5.5’s 93.18%. That gap is small on the benchmark — but GPQA Diamond is already an extremely hard test. At this level, a 2.3-point gap represents real capability difference on scientific and domain-expert reasoning tasks. For applications that require that reasoning depth, the premium is defensible.

Second, SWE-Bench Pro specifically. SWE-Bench Pro is a harder, more real-world version of the standard benchmark, requiring models to resolve genuine GitHub issues across production-grade, multi-language codebases with standardized scaffolding. Claude Opus 4.7 leads at 64.3%. Kimi K2.6 is the closest open-weight model at 58.6%, tied with GPT-5.5 — but Opus still holds a roughly 6-point edge. For high-stakes codebase work where a wrong patch breaks 40 downstream callers, that margin is meaningful. It is also why teams running complex repository-level reasoning tasks consistently report that Opus earns its cost.

Third, agentic reliability. Independent testing has shown that for complex multi-agent workflows — the kind involving concurrent tool calls, state management across long sessions, and real-time error handling — Claude Opus 4.7 consistently outperforms open-weight alternatives in ways that do not show up cleanly on benchmarks. One head-to-head comparison on a FlowGraph workflow orchestration task found that Opus 4.7 scored 91/100 versus Kimi K2.6’s 68/100, with the gap concentrated in lease handling, cross-run scheduling, and live SSE streaming. Benchmarks measure what labs choose to test. Production reliability shows up in the parts that were not on the test.

None of this means the closed-source premium is always worth it. It means it is worth it for a narrower set of use cases than the sticker price might imply.

The Open-Source Case: Four Models at Different Price-Performance Points

The interesting question is not “can open-weight models beat Claude Opus?” — they cannot, on the full AA Intelligence Index. The interesting question is “how close is good enough, and for which tasks?” The four open-weight models in this comparison represent four distinct answers.

Kimi K2.6: Coding and Long-Horizon Agents

Kimi K2.6, released April 20, 2026 by Moonshot AI, is a 1-trillion-parameter MoE model that scores 54 on the AA Intelligence Index — the highest among open-weight models. Its specialization is long-horizon agentic coding. On SWE-Bench Verified it scores 80.2%, within 2 points of Claude Opus 4.7. On SWE-Bench Pro (58.6%), it ties GPT-5.5 at a fraction of the cost. On Humanity’s Last Exam with tools (54.0%), it leads every model in the comparison including GPT-5.4, which is a striking result for an agentic retrieval task.

The architecture is built for sustained autonomous operation: K2.6 scales to 300 sub-agents and has been documented running 4,000 coordinated tool calls in a single 13-hour session. For teams building coding agents that need to run unattended for hours, this is directly relevant capability.

At $0.75/M input and $3.50/M output, K2.6 is roughly 6x cheaper than Claude Opus 4.7 on output tokens. At 50M output tokens per month, that is $200 versus $1,250. The gap is real. The caveats are also real: K2.6’s 256K context window is a structural ceiling that matters for full-repository ingestion tasks, it ranks poorly on multimodal benchmarks (26th out of 115), and structured-output reliability on complex tool schemas is not yet at parity with Anthropic’s models.

For teams whose primary workload is implementation-heavy coding — writing code, running tests, fixing errors — K2.6 is the most compelling open-weight option available today.

DeepSeek V4-Pro: The 1M Context and Coding Balance

DeepSeek V4-Pro, released April 24, 2026, brings a different value proposition: a 1M token context window at $3.48/M output — 7x cheaper than Claude Opus 4.7. On SWE-Bench Verified it scores 80.6%, within 1.5 points of Opus. On GPQA Diamond it scores 90.1%. On LiveCodeBench it leads all open models at 93.5%.

The architecture story matters here. V4-Pro’s 1M context is built on a genuinely new hybrid attention mechanism — Compressed Sparse Attention combined with Heavily Compressed Attention — that reduces KV cache requirements to 10% of what DeepSeek V3.2 required at that context length. For teams doing full-codebase analysis, long document processing, or any workflow that benefits from ingesting large amounts of context in a single pass, V4-Pro is the only open-weight model that approaches Opus-level coding performance while supporting that context window.

The honest caveat: V4-Pro is verbose. Independent testing showed the model generated 190 million output tokens running the AA Intelligence Index — four times the median for comparable models. At $3.48/M output, that verbosity makes per-task cost higher than the headline rate implies. For production cost modeling, measure output token counts on your actual workload before committing.

GLM-5.1: Agentic Web Development and Code Architecture

GLM-5.1, released April 7, 2026 by Z.AI (formerly Zhipu AI), is a 754-billion-parameter model trained entirely on Huawei Ascend 910B chips — zero NVIDIA GPUs. It scores 51 on the AA Intelligence Index and holds an independently verified Code Arena Elo of 1,530 (third globally on agentic web development). It was the first open-weight model to score 50 on the AA Intelligence Index, a threshold that was not reached by any open-weight model before 2026.

Where GLM-5.1 distinguishes itself from K2.6 is in code architecture quality. Comparative testing on React component generation found that GLM-5.1 spontaneously applied composition patterns and separated concerns in ways that other models did not — a reflection of training that went beyond benchmark optimization. For teams building front-end agents or working on component-level code generation tasks, this is a meaningful practical difference.

At $0.98/M input and $3.08/M output on OpenRouter, GLM-5.1 is priced comparably to DeepSeek V4-Pro while offering a different capability profile. Its 203K context window is the limiting factor for large-codebase ingestion, but for most mid-sized projects it is sufficient.

MiniMax M2.7: Volume Throughput at Near-Commodity Pricing

MiniMax M2.7, released March 18, 2026, is a different kind of option. At $0.28/M input and $1.20/M output, it is the cheapest model in this comparison on output tokens — cheaper than DeepSeek V4-Flash. It runs with only 10 billion active parameters, which means significantly faster inference than the MoE giants above.

On SWE-Bench Pro it scores 56.22%, roughly 8 points behind K2.6 and 8 points behind GPT-5.5. On the AA Intelligence Index it scores around 47. It is not a frontier model. It is a cost-optimization model for high-volume workflows where task complexity is bounded, inference speed matters, and the budget for per-task compute is tight.

For teams running summarization, code formatting, test generation, documentation, or other tasks where near-frontier quality at commodity pricing is the goal, M2.7 is the practical choice. For complex multi-step agentic coding, it is not.

The Pricing Math at Scale

The per-token rate differentials in this comparison are large enough that they compound significantly at production volume. Here is what a realistic agentic coding pipeline looks like across these models at 50M output tokens per month:

Model	Monthly Cost (50M output tokens)	vs. Claude Opus 4.7
GPT-5.5	$1,500	+20% more expensive
Claude Opus 4.7	$1,250	baseline
Kimi K2.6	$200	6.3x cheaper
DeepSeek V4-Pro	$174	7.2x cheaper
GLM-5.1	$154	8.1x cheaper
DeepSeek V4-Flash	$14	89x cheaper
MiniMax M2.7	$60	20.8x cheaper

At 100M output tokens monthly — a reasonable scale for a team with an active AI coding assistant running daily — the annual savings from DeepSeek V4-Pro versus Claude Opus 4.7 exceed $12,000. The savings from routing appropriately between models — using Opus for complex repository tasks and K2.6 or V4-Pro for implementation — can be several times larger than switching entirely to either.

This is the real implication of the current model landscape: routing is the leverage point. No single model wins on every dimension. GPT-5.5 and Opus 4.7 lead on overall reasoning breadth and complex agentic reliability. K2.6 and V4-Pro are within 2-3 percentage points on coding benchmarks at 6-7x lower output cost. MiniMax M2.7 handles volume tasks at near-commodity pricing. Building a pipeline that routes by task type rather than running everything through a single flagship model is where the economics improve most dramatically.

What the Data Does Not Tell You

Three caveats that belong in any honest benchmark comparison.

Hallucination rates vary significantly. The AA-Omniscience benchmark, which measures whether models acknowledge when they do not know an answer rather than confabulating, shows meaningful differences across models. V4-Pro has a high hallucination rate on this benchmark — when it does not know, it tends to respond anyway. This is not unique to DeepSeek, and it is important for factual retrieval applications.

Benchmark harnesses affect scores. Terminal-Bench 2.0 results in particular vary substantially depending on which agent framework runs the evaluation. Moonshot’s published K2.6 Terminal-Bench score of 66.7% uses the Terminus-2 harness; other evaluations of GPT-5.4 with different harness configurations report results up to 10 points higher than the Terminus-2 number. Do not use Terminal-Bench comparisons across labs as clean conclusions without verifying harness consistency.

Verbosity affects total task cost. DeepSeek V4’s models in particular generate significantly more output tokens per task than comparable models. At $3.48/M for V4-Pro, four times the median token output means the effective per-task cost is not 7x below Opus 4.7 — it is closer to 1.75x below. Still a meaningful saving, but not the headline number. Measure on your actual workload.

The Decision Framework

The choice between these models is not “which one is best?” It is “which one is best for this task, at this cost point?”

For complex multi-file repository reasoning, long-context document analysis, and workflows where per-task reliability is worth a premium: Claude Opus 4.7. Nothing in the open-weight field matches its SWE-Bench Pro score or its agentic reliability on complex concurrent workflows.

For implementation-heavy coding agents, long-horizon autonomous runs, and teams where the 256K context ceiling is not a blocking issue: Kimi K2.6. Near-Opus SWE-Bench Verified performance at 6x lower output cost, with architecture specifically built for sustained multi-step operation.

For workloads that need 1M context at near-frontier coding quality, or teams that prioritize long-context ingestion over pure coding benchmark score: DeepSeek V4-Pro. The only open-weight model combining 1M context with competitive SWE-Bench performance. Account for verbosity in cost modeling.

For front-end development, component architecture, and agentic web-development tasks where code quality patterns matter: GLM-5.1. Independent Code Arena data puts it third globally on agentic web development. It is not a top-tier reasoning model, but for its specialty it is the open-weight leader.

For high-volume pipelines where task complexity is bounded — summarization, test generation, documentation, classification: MiniMax M2.7. Lowest output cost of any competitive model in this group, active-parameter efficiency that translates to fast inference, and sufficient coding capability for non-frontier tasks.

Running These Models on DeepInfra

DeepSeek V4-Pro, DeepSeek V4-Flash, and the other open-weight models in this comparison are available on DeepInfra. The API is OpenAI-compatible: if you are already using the OpenAI SDK, switching requires changing the base URL and model name.

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_DEEPINFRA_API_KEY",
    base_url="https://api.deepinfra.com/v1/openai",
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Pro",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the tradeoffs between MoE and dense transformer architectures for long-context inference."},
    ],
    temperature=0.7,
)

print(response.choices[0].message.content)copy

Swap deepseek-ai/DeepSeek-V4-Pro for any other available model to test on your actual workload. The most informative thing you can do before making a routing decision is run the same representative task through two or three candidates and compare output quality versus token count. Benchmark scores are a starting point. Your workload is the real test.

The Bottom Line

The open-source model field has done something in the first four months of 2026 that would have seemed implausible at the end of 2025: it closed the coding benchmark gap with frontier closed-source models to within 2-3 percentage points, while maintaining a 6-7x price advantage on output tokens. Kimi K2.6 ties GPT-5.5 on SWE-Bench Pro. DeepSeek V4-Pro leads all open models on LiveCodeBench. GLM-5.1 holds the third-highest Code Arena Elo globally for agentic web development.

The gap that remains is real. On the AA Intelligence Index, GPT-5.5 at 60 and Claude Opus 4.7 at 57 are meaningfully ahead of any open-weight model. On SWE-Bench Pro, Opus 4.7’s 64.3% lead over K2.6’s 58.6% reflects genuine capability on the hardest real-world coding tasks. For applications where that margin matters — high-stakes code refactoring, complex multi-agent orchestration, frontier reasoning tasks — the closed-source premium is defensible.

For the majority of production workloads, the decision is more interesting than that. The right architecture in 2026 routes by task: closed-source for the fraction of requests where Opus-level capability is actually necessary, open-weight models for the rest. Getting that routing right is where teams building at scale will find the most leverage — better outcomes and lower bills, not just one or the other.

GLM-4.6 vs DeepSeek-V3.2: Performance, Benchmarks & DeepInfra ResultsThe open-source LLM ecosystem has evolved rapidly, and two models stand out as leaders in capability, efficiency, and practical usability: GLM-4.6, Zhipu AI’s high-capacity reasoning model with a 200k-token context window, and DeepSeek-V3.2, a sparsely activated Mixture-of-Experts architecture engineered for exceptional performance per dollar. Both models are powerful. Both are versatile. Both are widely adopted […]

Inference Economics: True AI Costs at ScaleMost teams discover their inference economics the same way: a production bill arrives that looks nothing like the number they expected. The per-token price seemed small enough during testing. Then real traffic showed up, agents started chaining calls, RAG pipelines bloated the context window, and suddenly the math looked completely different. Token prices have fallen […]

Build a Streaming Chat Backend in 10 MinutesWhen large language models move from demos into real systems, expectations change. The goal is no longer to produce clever text, but to deliver predictable latency, responsive behavior, and reliable infrastructure characteristics. In chat-based systems, especially, how fast a response starts often matters more than how fast it finishes. This is where token streaming becomes […]

View all