How Open Source AI Is Closing the Gap

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.07.01 by DeepInfra

At the end of 2023, the gap between open-weight and closed-source AI models was real and easy to describe. If you wanted the best performance on reasoning, language understanding, or multi-step problem solving, you paid for a proprietary API. Open models were useful, capable for many tasks, and dramatically cheaper to run but they were not considered production-grade alternatives to GPT-4 for anything that required frontier intelligence.

That assessment is no longer accurate. The convergence that practitioners had been observing informally became quantifiable by early 2026: the Stanford AI Index 2025 Report documented that the 17.5 percentage point gap between the best US and Chinese models on MMLU had effectively reached zero. On math benchmarks including MATH-500 and AIME, open models now lead the field outright. On graduate-level science reasoning (GPQA Diamond), they are competitive with all but the most expensive frontier options.

This article covers how that happened, which domains still have a meaningful closed-model advantage, and what it means for teams deciding where to route their workloads.

The Benchmark Timeline

The trajectory accelerated in phases rather than linearly.

2023: Meta’s LLaMA release in early 2023 seeded the fine-tuning era. The weights were leaked before an official open release, but the effect was the same: a capable base model that thousands of researchers and developers could iterate on. By the end of 2023, Mistral had shipped a competitive 7B model that punched significantly above its size, and the open-weight ecosystem had its first real infrastructure story — models that could run on a single consumer GPU while remaining genuinely useful.
2024: The MoE architecture shift changed the economics of open models. DeepSeek V2 and Mixtral demonstrated that sparse expert models could activate only a fraction of their parameters per token, delivering performance comparable to much larger dense models at a fraction of the inference cost. Qwen 2 emerged as the first Chinese open-weight flagship with broad international adoption, and Qwen 2.5 — released in September 2024 — marked the moment when Chinese model downloads on Hugging Face began to overtake US models. Meta’s Llama 3.1 series, including the 405B variant, matched GPT-4 on many benchmarks while being freely downloadable.
January 2025: DeepSeek R1. A reasoning model under an MIT license that matched OpenAI’s o1 on most benchmarks, trained for a reported cost of around $6 million. The release did two things simultaneously: it demonstrated that frontier reasoning capability was achievable by a small team without access to US export-restricted compute, and it inspired a wave of Chinese labs to commit to day-zero open releases of their best models. The open-weight ecosystem went from being incrementally competitive to structurally reshaping the frontier.
2025–2026: Qwen 3 became the most downloaded open model family globally, passing Meta’s Llama in total downloads by October 2025 and crossing one billion cumulative Hugging Face downloads in January 2026. In February 2026, Qwen alone accounted for 153.6 million downloads in a single month — more than double the combined total of the next eight providers. DeepSeek V3.2 achieved gold-medal performance at the 2025 International Mathematical Olympiad, IOI, and ICPC World Finals. Kimi K2 shipped as a trillion-parameter MoE model under a permissive license. GLM-5 from Zhipu, released in mid-2026, arrived with a one-million-token context window and early benchmark results competitive with leading closed models on math reasoning.

Where the Gap Has Closed

The convergence is most complete on knowledge and reasoning benchmarks that dominated the 2023 and 2024 AI evaluation landscape.

General knowledge (MMLU, MMLU-Pro): The 17.5-point gap that existed at end-2023 is now effectively zero on MMLU. On MMLU-Pro, open models including DeepSeek V4 Pro (73.5) and Qwen3 235B sit within a few points of closed frontier options.
Mathematics: Open models now lead this category. DeepSeek R1 scores 97.3% on MATH-500 which is the highest of any model, open or closed. DeepSeek V3.2-Speciale won gold at the International Mathematical Olympiad in 2025, the first time an AI system achieved that result. Qwen3 235B scores 85.7% on AIME 2024.
Graduate-level science reasoning (GPQA Diamond): Qwen 3.5 scores 88.4 on GPQA Diamond, competitive with all but the most expensive frontier closed options. Open models have gone from being non-competitive on this benchmark to leading it within roughly 18 months.
Long-context retrieval: The context window story has largely resolved in open source’s favor. Kimi K2 ships with a 256K context window under a modified MIT license. DeepSeek V4 Pro and Flash support one million tokens. GLM-5 offers one million tokens. The era when closed models had exclusive access to long-context capability is over.
Multilinguality: Qwen’s dominance in global downloads reflects genuine multilingual capability rather than just model scale. The family supports more than 100 languages. In February 2026, Qwen’s download dominance was particularly pronounced in non-English markets where alternative open options were weaker.

Where Closed Models Still Lead

The remaining advantage for closed models is concentrated in a specific set of tasks. It is worth naming them precisely rather than gesturing at a general frontier gap that no longer describes most workloads.

Complex agentic tasks and human preference: Chatbot Arena, which measures real-world human preference across open-ended conversations, still shows closed models — particularly GPT-5 and Claude 4 — at the top of the overall rankings. For tasks requiring nuanced instruction-following across a long interaction, tonal precision, or high-stakes synthesis, closed frontier models maintain a meaningful lead. This matters for consumer-facing applications where the user experience is the product.
Safety and calibration: Closed models, particularly those from Anthropic and OpenAI, have invested more heavily in safety fine-tuning, constitutional training, and output calibration. For applications where hallucination rate, refusal quality, and output consistency under adversarial inputs matter, this remains a real differentiator. DeepSeek V4 Pro, for example, has a documented 94% hallucination rate on the AA-Omniscience benchmark meaning it nearly always produces an answer even when it does not know one. That is a specific, known limitation worth accounting for in production use cases where confidence calibration matters.
Latest multimodal capability: Gemini 2.5 Pro’s multimodal reasoning, particularly on documents, charts, and long visual contexts, remains a clear differentiator. Open-weight multimodal models are competitive at image understanding but generally do not match the best closed models on complex visual reasoning or long multimodal contexts as of mid-2026.
Novel capability release timing: Closed models still tend to ship new capabilities first, with open models following within roughly one to three months. The lag has compressed dramatically from the 12-plus month gap that characterized 2023, but it remains real at the cutting edge.

The Ecosystem Effect

Benchmark convergence understates the actual competitive shift because it does not capture ecosystem dynamics.

Qwen’s 113,000 derivative models on Hugging Face means the Qwen base has been fine-tuned for more specific use cases than any other model family. That kind of derivative ecosystem compounds in ways that benchmark scores cannot capture domain-specific fine-tunes, quantization work, deployment tooling, and community documentation all accumulate on top of a popular base. Alibaba has more derivative models than Google and Meta combined on Hugging Face. That is a structural moat, not just a quality signal.

The same effect applies to infrastructure. Open-weight models can be served by any inference provider, which drives price competition down and availability up. DeepSeek V3 and its variants are available on dozens of providers simultaneously. Closed models are available only from their originating labs or authorized resellers, and pricing reflects that monopoly on supply. For high-volume workloads, the cost differential between open and closed models ranges from 4x to 30x depending on the specific models compared and that gap is structural rather than temporary.

There is also a geographic dimension. Chinese models now dominate open-source download rankings globally. The shift from US-dominant to China-dominant downloads happened in the summer of 2025, according to the ATOM Project’s tracking of Hugging Face data. Whether that represents a long-term reorientation of the ecosystem or a temporary advantage from a wave of competitive releases remains to be seen. But as of mid-2026, the open-source frontier is defined primarily by labs in China, not Silicon Valley.

What This Means in Practice

For most production workloads like document analysis, structured output generation, RAG pipelines, multilingual processing, summarization, classification, the decision between open and closed is no longer a quality decision. It is a cost and control decision. Open models are the economically rational default for these use cases, and the quality difference on specific tasks needs to be measured rather than assumed.

For workloads at the edge of what models can do, closed frontier models maintain a real advantage. That advantage is narrowing with each release cycle, and the lag time between a capability appearing in closed models and a competitive open alternative is measured in months rather than years.

The practical takeaway is that the default assumption should now run in the opposite direction from 2023. The right question is no longer “why would I use an open model?” but “why do I specifically need a closed one?” For a growing share of real workloads, there is no good answer to the second question.

Open Source Models on DeepInfra

DeepInfra serves the full range of open-weight frontier models discussed here — DeepSeek V4 Pro and Flash, Kimi K2, Qwen3, GLM-5, Llama 4, Gemma 4, and more — with H100-backed infrastructure, low and predictable TTFT, and usage-based pricing with no contracts. DeepSeek V3.2 starts at $0.26 per million input tokens. Kimi K2 at $0.40 per million input and $2.00 per million output. For the broad middle of production AI workloads, that is the math that matters.

Explore all available models: deepinfra.com/models

Best Kimi K2.6 API Providers for Developers (2026)Kimi K2.6 is available across a range of hosted API providers, and the right choice depends on what your workload optimizes for — latency, throughput, cost, deployment flexibility, or native feature support. This guide covers the top options by use case. For a detailed cost breakdown across workload types, see the Kimi K2.6 pricing guide. […]

GLM-5.2 Model Overview and Integration GuideGLM-5.2 is Z.AI’s flagship open-source large language model, engineered for long-horizon coding, agentic, and reasoning tasks. Designed for complex reasoning, advanced software engineering, and large-scale data processing, GLM-5.2 introduces a massive 1,048,576-token context window alongside significant architectural innovations. Hosted on the DeepInfra platform, GLM-5.2 provides developers with a high-performance, OpenAI-compatible interface. Whether you are building […]

Open vs Closed Source AI Models: Intelligence, Price & Speed ComparedThe LLM landscape in 2026 looks nothing like it did two years ago. Back then the assumption was simple: if you wanted the best model, you paid OpenAI or Anthropic, and that was that. Open source models were a respectable second tier, good for experimentation, fine-tuning, and budget workloads, but not quite there for serious […]

View all