We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

Kimi K2.5 API Benchmarks: Latency, Throughput & Cost
Published on 2026.04.03 by DeepInfra
Kimi K2.5 API Benchmarks: Latency, Throughput & Cost

About Kimi K2.5

Kimi K2.5 is Moonshot AI’s flagship open-source reasoning model, released in January 2026. It is a native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens. The model features a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion activated parameters.

Kimi K2.5 operates in both “Thinking” and “Instant” modes, allowing developers to toggle between deep chain-of-thought reasoning and faster, direct responses. The model supports a 256K token context window and excels in visual knowledge, cross-modal reasoning, and agentic tool use. One of its standout capabilities is “Agent Swarm” technology, which enables the model to decompose complex tasks into parallel sub-tasks executed by dynamically instantiated, domain-specific agents.

On benchmarks, Kimi K2.5 has set state-of-the-art records on Humanity’s Last Exam (HLE), BrowseComp, and other agentic benchmarks, achieving 50.2% on HLE with tools, 96.1% on AIME 2025, and 76.8% on SWE-Bench Verified.

Kimi K2.5 is now available across multiple inference providers — but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.

Kimi K2.5 (Reasoning) API Review Summary

  • DeepInfra is the lowest-cost option at $0.90 blended / 1M tokens (3:1 input:output blend), beating other low-cost options at $1.00 (Nebius Fast, Parasail, Nebius).
  • DeepInfra has the lowest input token price: $0.45 / 1M input tokens (next best: $0.50 at Nebius Fast and Together.ai).
  • DeepInfra has the lowest output token price: $2.25 / 1M output tokens (next best: $2.50 at Clarifai and Nebius Fast).
  • Provider blended prices vary ~1.3x across providers — DeepInfra’s pricing advantage is meaningful and consistent vs. the market range.
  • DeepInfra Turbo delivers much higher output speed (334 t/s) at a still-competitive blended price ($1.20 / 1M tokens) for those needing higher throughput.

Kimi K2.5 (Reasoning) — Best APIs

ProviderBest ForBlended ($/1M)Input ($/1M)Output ($/1M)Speed (t/s)Latency (TTFT)Why Notable
DeepInfraLowest cost / scale-out workloads$0.90$0.45$2.25661.06sBest unit economics — lowest blended, input, and output pricing. Ideal for batch, large-context, and cost-sensitive production.
DeepInfra TurboCost-aware speed upgrade$1.203340.69sPay a bit more, get far more speed — while staying in the mainstream price band.
Nebius FastLow cost + high speed$1.00$0.50$2.503381.86sFast throughput near top tier while staying close to the low-price floor.
Together.aiMaximum throughput$1.07$0.50431.11.37sFastest output speed measured; good for throughput-first systems at a still-competitive price.
BasetenLowest latency$1.203340.40sBest TTFT for interactive UX, though at higher blended price than DeepInfra.

Quick Verdict: Which Kimi K2.5 Provider is Best?

Based on benchmarks across 17 tracked providers, DeepInfra is the recommended API for production-scale Kimi K2.5 deployment. It offers the market’s lowest price ($0.90/1M) for background tasks and a high-performance Turbo tier ($1.20/1M) that rivals the fastest competitors in throughput and latency. For maximum throughput, Together.ai leads at 431.1 t/s. For the lowest latency, Baseten delivers a best-in-class 0.40s TTFT.

Overall Winner: DeepInfra

Best for: Cost efficiency and flexible performance tiers.

DeepInfra secures the top spot by offering a bifurcated service model that caters to both cost-sensitive batch processing and high-performance interactive applications. It is currently the most affordable provider on the market.

  • Price (Standard): $0.90 per 1M tokens (Blended)
  • Price (Turbo): $1.20 per 1M tokens (Blended)
  • Context Window: 262k tokens
  • API Features: JSON Mode, Function Calling (Standard tier)

At $0.90 per 1M tokens, DeepInfra is the cheapest option available, undercutting the closest competitors (Nebius Fast and Parasail) by 10%. The Turbo tier jumps to 334 tokens/sec with a latency of 0.69s, giving developers the flexibility to use the Standard tier for background reasoning tasks and the Turbo tier for user-facing applications — all within the same ecosystem.

Important: While DeepInfra Standard supports Function Calling, DeepInfra Turbo does not currently list this feature. Developers requiring tool use should select the Standard endpoint or verify recent updates.

Best for Throughput: Together.ai

Best for: High-volume text generation and long-context reasoning.

If raw generation speed is the primary KPI, Together.ai is the market leader. Kimi K2.5 is a reasoning model, meaning it generates “thinking” tokens before the final answer — high output speed is critical to reducing total wait time.

  • Output Speed: 431.1 tokens/sec
  • Latency (TTFT): 1.37s
  • Price: $1.07 per 1M tokens

Together.ai clocks in at 431.1 t/s — approximately 14.3x faster than the slowest provider (SiliconFlow). It outperforms the second-fastest provider, Eigen AI, by a margin of ~7 t/s. Despite this premium speed, its pricing ($1.07) remains highly competitive, sitting only slightly above the $1.00 budget tier.

Best for Low Latency: Baseten

Best for: Real-time chatbots and interactive agents.

For applications where the perceived speed (Time to First Token) is more important than total generation time, Baseten offers the most responsive infrastructure.

  • Latency (TTFT): 0.40 seconds
  • Output Speed: 334 tokens/sec
  • Price: $1.20 per 1M tokens

Baseten achieves a remarkable 0.40s TTFT — significantly faster than the average provider, beating the runner-up FriendliAI (0.52s) by 120ms. It maintains a high output speed of 334 t/s (identical to DeepInfra Turbo), ensuring that once the first token appears, the rest of the response follows rapidly.

Best Value Alternative: Nebius Fast

Best for: A balance of speed and pricing.

Nebius Fast offers a compelling sweet spot between the extreme speed of Together.ai and the extreme economy of DeepInfra.

  • Price: $1.00 per 1M tokens
  • Output Speed: 338.3 tokens/sec
  • Latency (TTFT): 1.86s

Nebius Fast matches DeepInfra Turbo’s throughput (~338 t/s) but at a lower price point ($1.00 vs $1.20). However, it suffers in latency metrics with a TTFT of 1.86s — nearly 4.5x slower than Baseten. It is an excellent choice for non-interactive workloads where throughput per dollar is the primary metric.

Comparative Technical Metrics

Speed vs. Latency (Top 5 Providers)

ProviderOutput Speed (t/s)Latency (TTFT)
Together.ai431.11.37s
Eigen AI423.71.14s
Clarifai370.70.74s
Fireworks353.70.62s
DeepInfra Turbo334.00.69s

Price Efficiency (Lowest to Highest)

ProviderBlended Price (/1M)Input PriceOutput Price
DeepInfra$0.90$0.45$2.25
Nebius Fast$1.00$0.50$2.50
Parasail$1.00N/AN/A
Clarifai$1.07N/A$2.50
Together.ai$1.07$0.50N/A

Feature Support: JSON Mode & Tool Calling

Technical integration is just as important as raw speed.

  • JSON Mode: Supported by 15 of 17 providers, including DeepInfra, Together.ai, and Baseten. Critical for ensuring the model outputs valid JSON objects for programmatic parsing.
  • Function Calling (Tool Use): Supported by 13 of 17 providers. DeepInfra Standard supports it; DeepInfra Turbo does not currently. Developers needing tool use must select the Standard endpoint or verify recent updates.

Technical Integration Guide

Most providers hosting Kimi K2.5 utilize OpenAI-compatible endpoints. Here is how to configure your client for DeepInfra:

import os
from openai import OpenAI

# Configuration for DeepInfra (Best Value)
client = OpenAI(
    base_url="https://api.deepinfra.com/v1/openai",
    api_key=os.environ.get("DEEPINFRA_API_KEY"),
)

response = client.chat.completions.create(
    model="moonshotai/kimi-k2.5-reasoning",
    messages=[{"role": "user", "content": "Explain quantum entanglement."}],
    stream=True
)
copy

Note: When using Kimi K2.5, “Reasoning Tokens” are billed as output tokens. Ensure your max_tokens limit accounts for the internal chain-of-thought process.

Frequently Asked Questions

Does Kimi K2.5 support Function Calling on all providers?

No. While the model supports it natively, DeepInfra Turbo does not currently support function calling, whereas DeepInfra Standard, Together.ai, and Baseten do.

How does Kimi K2.5 compare to DeepSeek R1?

Kimi K2.5 generally offers higher throughput on equivalent hardware, though DeepSeek R1 remains cheaper on legacy providers. Kimi’s advantage lies in its 262k context window and native multimodal capabilities.

What is the difference between DeepInfra Standard and Turbo?

Standard operates at ~66 t/s and costs $0.90/1M. Turbo operates at ~334 t/s and costs $1.20/1M. Use Standard for batch jobs and Turbo for live applications.

What are “Reasoning Tokens” and how are they billed?

Reasoning models like Kimi K2.5 generate internal “thinking” tokens before producing the final answer. These reasoning tokens are billed as output tokens. The prices listed in this benchmark include reasoning output tokens.

What is the context window for Kimi K2.5?

Kimi K2.5 supports a 256K–262K token context window depending on the provider configuration.

Conclusion

For the majority of developers, DeepInfra is the superior choice for Kimi K2.5. It offers the market’s lowest price ($0.90/1M) for background tasks and a high-performance Turbo tier ($1.20/1M) that rivals the fastest competitors in throughput and latency.

  • Choose DeepInfra for the best overall value and flexibility.
  • Choose Together.ai if your application requires generating massive amounts of text rapidly.
  • Choose Baseten if your application is a user-facing chatbot where every millisecond of initial latency counts.
Related articles
Deep Infra Launches Access to NVIDIA Nemotron Models for Vision, Retrieval, and AI SafetyDeep Infra Launches Access to NVIDIA Nemotron Models for Vision, Retrieval, and AI SafetyDeep Infra is serving the new, open NVIDIA Nemotron vision language and OCR AI models from day zero of their release. As a leading inference provider committed to performance and cost-efficiency, we're making these cutting-edge models available at the industry's best prices, empowering developers to build specialized AI agents without compromising on budget or performance.