Kimi K2.6 is Now Available on DeepInfra

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.04.30 by DeepInfra

Kimi K2.6 can coordinate up to 300 sub-agents executing 4,000 steps in a single autonomous run — Moonshot AI’s answer to the gap between what frontier models can do in a chat window and what production agentic systems actually need. Built for long-horizon coding, deep research, and complex orchestration, the model is open source under a Modified MIT license with weights publicly available on Hugging Face. On SWE-Bench Pro, it scores 58.6, outperforming GPT-5.4 (57.7) and Claude Opus 4.6 (53.4).

Under the hood, Kimi K2.6 is a Mixture-of-Experts model with 1 trillion total parameters and only 32 billion activated per token — a design that keeps inference tractable while preserving model capacity. A 256K-token context window and native multimodal support via a 400M-parameter vision encoder round out the architecture for cross-modal, long-context tasks. On DeepSearchQA, it scores 92.5 (F1-score) versus GPT-5.4’s 78.6 — a 14-point gap that reflects real capability on research-heavy agentic workflows. You can try it now on the Kimi K2.6 model page on DeepInfra, or browse the full text generation model catalog if you’re evaluating alternatives.

What Makes This Model Different

Kimi K2.6 is a Mixture-of-Experts (MoE) model with 1 trillion total parameters, but only 32B are activated per token — keeping inference costs in check at scale. The architecture runs 61 layers (including one dense layer), 384 experts with 8 selected per token, and uses Multi-head Latent Attention (MLA) with a 7168 attention hidden dimension. Vision input is handled by a native MoonViT encoder (400M parameters), making multimodality a first-class concern rather than a bolt-on. The model ships in native INT4 (fp4 on DeepInfra) and supports a 256K token context window.

The benchmark story is strongest in coding and agentic tasks. A few numbers worth highlighting against GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro:

Benchmark	K2.6	GPT-5.4	Claude Opus 4.6	Gemini 3.1 Pro
SWE-Bench Pro	58.6	57.7	53.4	54.2
LiveCodeBench v6	89.6	—	88.8	91.7
Terminal-Bench 2.0	66.7	65.4	65.4	68.5
DeepSearchQA (F1)	92.5	78.6	91.3	—
HLE-Full (w/ tools)	54.0	52.1	53.0	51.4
OSWorld-Verified	73.1	75.0	72.7	—

On reasoning benchmarks (HLE-Full, GPQA-Diamond), K2.6 scores competitively but trails the frontier slightly — the model is clearly optimized for execution rather than pure reasoning.

The agent swarm capability is architecturally significant. The model can coordinate up to 300 sub-agents across 4,000 steps in a single run, dynamically decomposing tasks into parallel, domain-specialized subtasks and returning structured end-to-end outputs — documents, spreadsheets, websites. This isn’t just a prompted behavior; it’s a design target reflected in benchmarks like BrowseComp (83.2, jumping to 86.3 in swarm mode) and WideSearch (80.8 item-f1). If you’re building document-heavy pipelines, the OCR-powered PDF reader and summarizer tutorial using Kimi K2 on DeepInfra is a practical starting point for seeing those capabilities in action.

Compared to Kimi K2.5, the gains are meaningful across the board:

SWE-Bench Pro: 50.7 → 58.6
OSWorld-Verified: 63.3 → 73.1
DeepSearchQA F1: 77.1 → 83.0
LiveCodeBench v6: 85.0 → 89.6
Toolathlon: 27.8 → 50.0

The Toolathlon jump in particular (27.8 → 50.0) suggests substantially improved tool-use reliability — which matters if you’re building agents that actually invoke external systems.

The model’s weights and code are publicly available under a Modified MIT license on Hugging Face and GitHub. On DeepInfra it’s served as moonshotai/Kimi-K2.6 with JSON and function calling support out of the box, and private endpoint deployment available through the dashboard.

Getting Started on DeepInfra

Kimi K2.6 is available on DeepInfra under the identifier moonshotai/Kimi-K2.6 as a public endpoint. Pricing is straightforward: $0.75 per 1M input tokens, $3.50 per 1M output tokens, and $0.15 per 1M cached tokens. If you want to understand how those numbers translate to real workload costs — including how prompt caching affects your bill — the token math and cost-per-completion guide is worth reading before you scale up. The model runs with fp4 quantization (native INT4) and supports a 262,144-token context window, along with JSON output and function calling out of the box.

DeepInfra gives you API access with zero infrastructure setup — just an API key and the standard OpenAI-compatible interface. You pay for what you use, nothing more. DeepInfra operates under a zero-retention policy and is SOC 2 and ISO 27001 certified.

To make your first call, grab your API key from the DeepInfra dashboard and drop it into any of the examples below:

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
      "model": "moonshotai/Kimi-K2.6",
      "messages": [
        {
          "role": "user",
          "content": "Write a Python script that monitors a directory for file changes and logs them."
        }
      ]
    }'copy

from openai import OpenAI


client = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)


response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.6",
    messages=[
        {
            "role": "user",
            "content": "Write a Python script that monitors a directory for file changes and logs them."
        }
    ],
)
print(response.choices[0].message.content)copy

import OpenAI from "openai";


const openai = new OpenAI({
  apiKey: "$DEEPINFRA_TOKEN",
  baseURL: "https://api.deepinfra.com/v1/openai",
});


const response = await openai.chat.completions.create({
  model: "moonshotai/Kimi-K2.6",
  messages: [
    {
      role: "user",
      content: "Write a Python script that monitors a directory for file changes and logs them.",
    },
  ],
});
console.log(response.choices[0].message.content);copy

The only things that differ from a standard OpenAI call are the base URL (https://api.deepinfra.com/v1/openai), your DeepInfra token, and the model name. If you already use the OpenAI Python or Node.js SDK, nothing else changes.

If you’re working on inference optimization and want to understand the tradeoffs between precision formats and quantization before deploying at scale, the practical guide to quantization and LLM inference costs covers how choices like fp4 and INT4 affect latency and throughput in practice. For a direct comparison between K2.6 and Kimi K2 Instruct, both are available on DeepInfra — or check the September 0905 update if you have existing workloads pinned to that version.

Conclusion

Kimi K2.6 is a credible option for teams building production agentic systems — not because of any single benchmark number, but because its architecture was designed around execution at scale from the ground up, and the numbers back that up across coding, research, and tool-use tasks. The jump from K2.5 to K2.6 in areas like Toolathlon and OSWorld reflects a model that’s maturing in the right direction. Developers working on autonomous coding pipelines, deep research agents, or multi-step orchestration workflows now have a capable, open-weight model they can deploy without vendor lock-in. The full spec, pricing details, and a browser-based API playground are all on the Kimi K2.6 model page on DeepInfra.

Model Distillation Making AI Models EfficientAI Model Distillation Definition & Methodology Model distillation is the art of teaching a smaller, simpler model to perform as well as a larger one. It's like training an apprentice to take over a master's work—streamlining operations with comparable performance . If you're struggling with depl...

Kimi K2.5 API Benchmarks: Latency, Throughput & Cost<p>About Kimi K2.5 Kimi K2.5 is Moonshot AI’s flagship open-source reasoning model, released in January 2026. It is a native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens. The model features a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion activated parameters. Kimi K2.5 […]</p>

Inference Economics: True AI Costs at Scale<p>Most teams discover their inference economics the same way: a production bill arrives that looks nothing like the number they expected. The per-token price seemed small enough during testing. Then real traffic showed up, agents started chaining calls, RAG pipelines bloated the context window, and suddenly the math looked completely different. Token prices have fallen […]</p>

View all