We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

Nemotron 3 Nano Omni — the first multimodal model in the Nemotron 3 family, now on DeepInfra!

Kimi K2.6 is Now Available on DeepInfra
Published on 2026.04.30 by DeepInfra
Kimi K2.6 is Now Available on DeepInfra

Kimi K2.6 can coordinate up to 300 sub-agents executing 4,000 steps in a single autonomous run — Moonshot AI’s answer to the gap between what frontier models can do in a chat window and what production agentic systems actually need. Built for long-horizon coding, deep research, and complex orchestration, the model is open source under a Modified MIT license with weights publicly available on Hugging Face. On SWE-Bench Pro, it scores 58.6, outperforming GPT-5.4 (57.7) and Claude Opus 4.6 (53.4).

Under the hood, Kimi K2.6 is a Mixture-of-Experts model with 1 trillion total parameters and only 32 billion activated per token — a design that keeps inference tractable while preserving model capacity. A 256K-token context window and native multimodal support via a 400M-parameter vision encoder round out the architecture for cross-modal, long-context tasks. On DeepSearchQA, it scores 92.5 (F1-score) versus GPT-5.4’s 78.6 — a 14-point gap that reflects real capability on research-heavy agentic workflows. You can try it now on the Kimi K2.6 model page on DeepInfra, or browse the full text generation model catalog if you’re evaluating alternatives.

What Makes This Model Different

Kimi K2.6 is a Mixture-of-Experts (MoE) model with 1 trillion total parameters, but only 32B are activated per token — keeping inference costs in check at scale. The architecture runs 61 layers (including one dense layer), 384 experts with 8 selected per token, and uses Multi-head Latent Attention (MLA) with a 7168 attention hidden dimension. Vision input is handled by a native MoonViT encoder (400M parameters), making multimodality a first-class concern rather than a bolt-on. The model ships in native INT4 (fp4 on DeepInfra) and supports a 256K token context window.

The benchmark story is strongest in coding and agentic tasks. A few numbers worth highlighting against GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro:

BenchmarkK2.6GPT-5.4Claude Opus 4.6Gemini 3.1 Pro
SWE-Bench Pro58.657.753.454.2
LiveCodeBench v689.688.891.7
Terminal-Bench 2.066.765.465.468.5
DeepSearchQA (F1)92.578.691.3
HLE-Full (w/ tools)54.052.153.051.4
OSWorld-Verified73.175.072.7

On reasoning benchmarks (HLE-Full, GPQA-Diamond), K2.6 scores competitively but trails the frontier slightly — the model is clearly optimized for execution rather than pure reasoning.

The agent swarm capability is architecturally significant. The model can coordinate up to 300 sub-agents across 4,000 steps in a single run, dynamically decomposing tasks into parallel, domain-specialized subtasks and returning structured end-to-end outputs — documents, spreadsheets, websites. This isn’t just a prompted behavior; it’s a design target reflected in benchmarks like BrowseComp (83.2, jumping to 86.3 in swarm mode) and WideSearch (80.8 item-f1). If you’re building document-heavy pipelines, the OCR-powered PDF reader and summarizer tutorial using Kimi K2 on DeepInfra is a practical starting point for seeing those capabilities in action.

Compared to Kimi K2.5, the gains are meaningful across the board:

  • SWE-Bench Pro: 50.7 → 58.6
  • OSWorld-Verified: 63.3 → 73.1
  • DeepSearchQA F1: 77.1 → 83.0
  • LiveCodeBench v6: 85.0 → 89.6
  • Toolathlon: 27.8 → 50.0

The Toolathlon jump in particular (27.8 → 50.0) suggests substantially improved tool-use reliability — which matters if you’re building agents that actually invoke external systems.

The model’s weights and code are publicly available under a Modified MIT license on Hugging Face and GitHub. On DeepInfra it’s served as moonshotai/Kimi-K2.6 with JSON and function calling support out of the box, and private endpoint deployment available through the dashboard.

Getting Started on DeepInfra

Kimi K2.6 is available on DeepInfra under the identifier moonshotai/Kimi-K2.6 as a public endpoint. Pricing is straightforward: $0.75 per 1M input tokens, $3.50 per 1M output tokens, and $0.15 per 1M cached tokens. If you want to understand how those numbers translate to real workload costs — including how prompt caching affects your bill — the token math and cost-per-completion guide is worth reading before you scale up. The model runs with fp4 quantization (native INT4) and supports a 262,144-token context window, along with JSON output and function calling out of the box.

DeepInfra gives you API access with zero infrastructure setup — just an API key and the standard OpenAI-compatible interface. You pay for what you use, nothing more. DeepInfra operates under a zero-retention policy and is SOC 2 and ISO 27001 certified.

To make your first call, grab your API key from the DeepInfra dashboard and drop it into any of the examples below:

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
      "model": "moonshotai/Kimi-K2.6",
      "messages": [
        {
          "role": "user",
          "content": "Write a Python script that monitors a directory for file changes and logs them."
        }
      ]
    }'
copy
from openai import OpenAI


client = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)


response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.6",
    messages=[
        {
            "role": "user",
            "content": "Write a Python script that monitors a directory for file changes and logs them."
        }
    ],
)
print(response.choices[0].message.content)
copy
import OpenAI from "openai";


const openai = new OpenAI({
  apiKey: "$DEEPINFRA_TOKEN",
  baseURL: "https://api.deepinfra.com/v1/openai",
});


const response = await openai.chat.completions.create({
  model: "moonshotai/Kimi-K2.6",
  messages: [
    {
      role: "user",
      content: "Write a Python script that monitors a directory for file changes and logs them.",
    },
  ],
});
console.log(response.choices[0].message.content);
copy

The only things that differ from a standard OpenAI call are the base URL (https://api.deepinfra.com/v1/openai), your DeepInfra token, and the model name. If you already use the OpenAI Python or Node.js SDK, nothing else changes.

If you’re working on inference optimization and want to understand the tradeoffs between precision formats and quantization before deploying at scale, the practical guide to quantization and LLM inference costs covers how choices like fp4 and INT4 affect latency and throughput in practice. For a direct comparison between K2.6 and Kimi K2 Instruct, both are available on DeepInfra — or check the September 0905 update if you have existing workloads pinned to that version.

Conclusion

Kimi K2.6 is a credible option for teams building production agentic systems — not because of any single benchmark number, but because its architecture was designed around execution at scale from the ground up, and the numbers back that up across coding, research, and tool-use tasks. The jump from K2.5 to K2.6 in areas like Toolathlon and OSWorld reflects a model that’s maturing in the right direction. Developers working on autonomous coding pipelines, deep research agents, or multi-step orchestration workflows now have a capable, open-weight model they can deploy without vendor lock-in. The full spec, pricing details, and a browser-based API playground are all on the Kimi K2.6 model page on DeepInfra.

Related articles
Art That Talks Back: A Hands-On Tutorial on Talking ImagesArt That Talks Back: A Hands-On Tutorial on Talking ImagesTurn any image into a talking masterpiece with this step-by-step guide using DeepInfra’s GenAI models.
Qwen3 Coder 480B A35B API Benchmarks: Latency & CostQwen3 Coder 480B A35B API Benchmarks: Latency & Cost<p>About Qwen3 Coder 480B A35B Instruct Qwen3 Coder 480B A35B Instruct is a state-of-the-art large language model developed by the Qwen team at Alibaba Cloud, specifically designed for code generation and agentic coding tasks. It is a Mixture-of-Experts (MoE) model with 480 billion total parameters and 35 billion active parameters per inference, enabling high performance [&hellip;]</p>
DeepSeek V4 Pro: Model Overview, Features & Performance GuideDeepSeek V4 Pro: Model Overview, Features & Performance Guide<p>DeepSeek V4 Pro is a 1.6-trillion parameter Mixture-of-Experts (MoE) model from DeepSeek, released on April 24, 2026 under the MIT license. It is designed for advanced reasoning, complex software engineering, and long-running agentic tasks, and arrives alongside DeepSeek-V4-Flash, a lighter 284B-parameter variant built for faster, lower-cost inference. The V4 series is DeepSeek&#8217;s first two-tier lineup [&hellip;]</p>