We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

MiniMax-M2.5 API Benchmarks: Latency, Throughput & Cost
Published on 2026.04.03 by DeepInfra
MiniMax-M2.5 API Benchmarks: Latency, Throughput & Cost

About MiniMax-M2.5

MiniMax-M2.5 is a state-of-the-art open-weights large language model released in February 2026. Built on a 230B-parameter Mixture of Experts (MoE) architecture with approximately 10 billion active parameters per forward pass, it features Lightning Attention and supports a context window of up to 205,000 tokens. The model uses extended chain-of-thought reasoning to work through complex problems.

M2.5 was trained extensively with reinforcement learning across more than 200,000 real-world environments, covering over 10 programming languages including Go, C, C++, TypeScript, Rust, Kotlin, Python, Java, JavaScript, PHP, Lua, Dart, and Ruby. A notable characteristic is its spec-writing tendency — before writing any code, M2.5 actively decomposes and plans features, structure, and UI design from the perspective of an experienced software architect.

The model achieves industry-leading benchmark scores: 80.2% on SWE-Bench Verified, 51.3% on Multi-SWE-Bench, and 76.3% on BrowseComp. It completes SWE-Bench Verified evaluations 37% faster than its predecessor M2.1 while consuming fewer tokens. Beyond coding, M2.5 excels at office productivity tasks including generating formatted Word documents, PowerPoint presentations, and Excel spreadsheets with working formulas. The model weights are fully open-sourced on HuggingFace under a Modified MIT License.

MiniMax-M2.5 is now available across multiple inference providers, but they’re not created equal. This analysis breaks down which one delivers the best performance, lowest cost, and fastest response times for your use case.

MiniMax-M2.5 API Review Summary

  • Open weights model (released February 2026) with multiple tracked API providers.
  • Benchmarks reflect sustained performance: median (P50) over the past 72 hours, using a default workload of 10,000 input tokens.
  • DeepInfra (FP8) is the standout balanced provider: #2 lowest blended price ($0.44 / 1M tokens) and #3 lowest latency (0.56s TTFT).
  • DeepInfra (FP8) leads on token pricing: #2 lowest input price ($0.27 / 1M) and #1 lowest output price ($0.95 / 1M output tokens).
  • Category leaders: SambaNova fastest output speed (394.6 t/s), Together.ai (FP4) lowest latency (0.42s), SiliconFlow (FP8) lowest blended price ($0.40).

MiniMax-M2.5 — Best APIs

ProviderWhy It’s BestBlended ($/1M)Input ($/1M)Output ($/1M)Latency (TTFT)Speed (t/s)E2E (s/500 tok)
DeepInfra (FP8)Best value + low latency balance; cheapest output tokens; strong for cost-sensitive apps needing snappy first-token response$0.44$0.27$0.950.56s6638.64s
SiliconFlow (FP8)Lowest blended price overall (budget-first)$0.40$0.20$1.001.90s8531.47s
Together.ai (FP4)Lowest latency (interactive-first)$0.530.42s9526.80s
FireworksVery high throughput (speed-first)$0.530.76s19313.71s
ClarifaiStrong low-latency + good speed combination$0.530.54s14318.07s
SambaNovaFastest output speed overall (throughput-max)$0.531.60s3957.93s

Quick Verdict: Which MiniMax-M2.5 Provider is Best?

Based on benchmarks across tracked providers, DeepInfra is the recommended API for production-scale MiniMax-M2.5 deployment. It offers the best balance of low latency, competitive pricing, and full feature support. For use cases requiring maximum throughput, SambaNova leads the field. For absolute lowest latency, Together.ai is the fastest provider tested.

Overall Recommendation: DeepInfra (FP8)

DeepInfra emerges as the superior choice for production workloads, offering the most robust balance of low latency, competitive pricing, and feature completeness. While other providers may win in a single metric, DeepInfra consistently scores in the top tier across all critical categories without significant trade-offs.

  • Latency (TTFT): 0.56s (3rd lowest overall)
  • Output Speed: 66 t/s
  • Pricing: $0.27 Input / $0.95 Output ($0.44 Blended)
  • Context Window: 197k
  • Feature Support: Full support for both JSON Mode and Function Calling
  • Best For: RAG applications, Agentic workflows, General Chat

DeepInfra utilizes FP8 quantization to deliver a TTFT of 0.56 seconds — nearly indistinguishable from the fastest provider (Together.ai at 0.42s) for human perception. Crucially, it achieves this while being significantly cheaper ($0.44 vs $0.53 per 1M tokens) than the majority of the market.

Unlike the fastest throughput providers (SambaNova) which suffer from high latency (1.60s), DeepInfra maintains a snappy interactive feel. For developers building RAG applications or agents requiring tool use, DeepInfra’s combination of low latency, sub-$0.50 pricing, and full tool-calling support makes it the definitive all-rounder.

Integration Example (Python)

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_DEEPINFRA_KEY",
    base_url="https://api.deepinfra.com/v1/openai"
)

response = client.chat.completions.create(
    model="minimax/minimax-m2.5",
    messages=[{"role": "user", "content": "Explain quantum entanglement."}],
)
copy

The Throughput Specialist: SambaNova

If your use case involves batch processing or generating long-form content where the start time matters less than the completion time, SambaNova is the undisputed leader in throughput.

  • Output Speed: 394.6 tokens/second (13x faster than the slowest provider)
  • Latency (TTFT): 1.60s
  • Price: $0.53 per 1M tokens
  • Context Window: 164k
  • Best For: Batch processing, Summarization, Offline jobs

SambaNova’s architecture delivers 394.6 t/s — roughly 2x faster than the second-fastest provider (Fireworks) and 13x faster than the slowest. SambaNova uses a specialized Dataflow Unit (RDU) architecture rather than standard GPUs, allowing for massive throughput at the cost of higher initial latency.

This speed comes with trade-offs. SambaNova has the lowest context window in the benchmark set (164k vs the standard ~200k) and a relatively high TTFT of 1.60s. This makes it ideal for background generation tasks but less suitable for real-time conversational interfaces.

The Latency Leader: Together.ai (FP4)

For real-time chat applications where perceived speed is defined by how quickly the first word appears, Together.ai takes the lead.

  • Latency (TTFT): 0.42s
  • Output Speed: 95 tokens/second
  • Price: $0.53 per 1M tokens
  • Quantization: FP4
  • Context Window: 197k
  • Best For: Real-time conversational AI, Customer support bots

Utilizing aggressive FP4 quantization, Together.ai achieves the lowest latency in the field at 0.42s. However, this comes at a premium price point ($0.53) compared to budget options. Its output speed of 95 t/s is also significantly slower than the top throughput providers — it is the fastest to start, but not the fastest to finish large generations.

The Cost Efficiency Leader: SiliconFlow (FP8)

For developers operating on tight margins or processing massive volumes of non-time-sensitive data, SiliconFlow offers the absolute lowest floor price.

  • Blended Price: $0.40 per 1M tokens ($0.20 Input / $1.00 Output)
  • Latency (TTFT): 1.90s
  • Output Speed: 85 tokens/second
  • Context Window: 197k
  • Best For: Academic research, Hobbyist projects, Non-urgent data extraction

SiliconFlow is the most affordable provider analyzed, undercutting the standard market rate by roughly 24%. However, this cost saving comes with a latency trade-off. With a TTFT of 1.90s, it has one of the slowest response times in the benchmark — nearly 4x slower than DeepInfra. It is an excellent choice for offline batch jobs but is not recommended for user-facing applications.

The High-Performance Contender: Fireworks

Fireworks serves as a strong alternative for those who need high speed but cannot tolerate the high latency of SambaNova.

  • Output Speed: 193.1 tokens/second
  • Latency (TTFT): 0.76s
  • Price: $0.53 per 1M tokens
  • Context Window: 197k

Fireworks holds the #2 spot for output speed (193.1 t/s) while maintaining a respectable sub-second latency (0.76s). It bridges the gap between the speed leaders and the latency leaders. However, at $0.53 per 1M tokens, it is notably more expensive than DeepInfra without offering the same low-latency benefits.

Comparative Specs Table

ProviderOptimizationInput PriceOutput PriceLatency (TTFT)Speed (t/s)ContextJSON / Tools
DeepInfraBalanced (Recommended)$0.27$0.950.56s66197kYes / Yes
SambaNovaThroughput$0.50$0.551.60s394.6164kYes / Yes
Together.aiLatency$0.50$0.550.42s95197kYes / Yes
SiliconFlowCost$0.20$1.001.90s85197kYes / Yes
FireworksSpeed / Hybrid$0.50$0.550.76s193.1197kYes / Yes
ClarifaiHybrid$0.50$0.550.54s142.6205kYes / Yes
MiniMax DirectNative$0.50$0.553.23s49205kNo / Yes

Frequently Asked Questions

Which MiniMax-M2.5 provider is best for RAG?

DeepInfra is the best choice for RAG due to its low TTFT (0.56s) and full support for JSON Mode and Function Calling, which are essential for retrieving and formatting context.

Does MiniMax-M2.5 support Function Calling?

Yes, the model supports function calling, but not all providers enable it. DeepInfra, Together.ai, and Fireworks support full tool-use, while the MiniMax Direct API currently has limited support.

Why is SambaNova so much faster?

SambaNova uses a specialized Dataflow Unit (RDU) architecture rather than standard GPUs, allowing for massive throughput (394 t/s) at the cost of higher initial latency.

Is MiniMax-M2.5 good for coding?

Yes. MiniMax-M2.5 achieves state-of-the-art performance in programming evaluations, scoring 80.2% on SWE-Bench Verified. The model was trained on over 10 languages across more than 200,000 real-world environments and excels at the entire development lifecycle — from system design to code review.

DeepInfra vs. Together.ai for MiniMax-M2.5?

DeepInfra offers better value at $0.44/1M tokens vs Together.ai’s $0.53/1M tokens. Together.ai has slightly lower latency (0.42s vs 0.56s), but for most applications, this 140ms difference is imperceptible to users. DeepInfra is the recommended choice unless sub-half-second latency is absolutely critical.

Conclusion

For the vast majority of MiniMax-M2.5 implementations, DeepInfra is the logical choice. It provides a premium low-latency experience (0.56s) usually reserved for more expensive providers, while maintaining a near-bottom-tier price point ($0.44). While SambaNova is technically superior for pure bulk text generation, DeepInfra’s versatility across RAG, agents, and chat interfaces makes it the standout provider in this benchmark.

Related articles
Getting StartedGetting StartedGetting an API Key To use DeepInfra's services, you'll need an API key. You can get one by signing up on our platform. Sign up or log in to your DeepInfra account at deepinfra.com Navigate to the Dashboard and select API Keys Create a new ...
Qwen3 Coder 480B A35B API Benchmarks: Latency & CostQwen3 Coder 480B A35B API Benchmarks: Latency & Cost<p>About Qwen3 Coder 480B A35B Instruct Qwen3 Coder 480B A35B Instruct is a state-of-the-art large language model developed by the Qwen team at Alibaba Cloud, specifically designed for code generation and agentic coding tasks. It is a Mixture-of-Experts (MoE) model with 480 billion total parameters and 35 billion active parameters per inference, enabling high performance [&hellip;]</p>
Qwen3.5 397B A17B API Benchmarks: Latency, Throughput & CostQwen3.5 397B A17B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 397B A17B Qwen3.5 397B A17B is Alibaba Cloud&#8217;s largest and most capable multimodal foundation model, released in February 2026. It features a hybrid Mixture-of-Experts (MoE) architecture with 397 billion total parameters and 17 billion active parameters per inference pass, utilizing 512 experts with a routing mechanism selecting a subset per token. This sparse [&hellip;]</p>