We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Best MiMo-V2.5 API Providers Ranked
Published on 2026.07.01 by DeepInfra
Best MiMo-V2.5 API Providers Ranked

Executive Summary: Selecting the right API provider for Xiaomi’s MiMo-V2.5 is critical for optimizing production workflows. Based on the benchmark research, DeepInfra is the best provider for raw speed and low latency (130+ tokens/second), while Xiaomi’s first-party API is the most cost-effective, offering unmatched prompt caching discounts. This guide breaks down the model’s MoE architecture and ranks the top API providers by throughput, latency, and cost.

MiMo-V2.5 API Review Summary (2026-06-27)

  • Vendor / Release: Xiaomi · Released: April 22, 2026 · Open weights
  • License: MIT (commercial use permitted)
  • Model type: Reasoning model (extended thinking) · MoE: 310B total params / 15B active
  • Training Data: 48 trillion tokens
  • Modalities: Input: text, image, video, audio · Output: text
  • Context window: 1.0M tokens (~1,500 A4 pages)
  • Intelligence (Artificial Analysis Intelligence Index): 40 (estimated) (above comparable open-weight median: 25)
  • Speed: 87.2 output tokens/sec (above comparable median: 68.7 t/s)
  • Latency (TTFT): 2.76s (somewhat higher than comparable median: 2.35s)
  • Price (Xiaomi API): $0.14 / 1M input tokens, $0.28 / 1M output tokens (competitive vs open-weight medians: $0.55 / $1.90)
  • Cache hit price: $0.003 / 1M tokens · Blended (7:2:1 cache/input/output): $0.06 / 1M tokens
  • Weights: huggingface.co/XiaomiMiMo/MiMo-V2.5

MiMo-V2.5 – Best APIs

ProviderWhy It’s a Strong Option for MiMo-V2.5Best ForKey Checks Before Committing
DeepInfra (deepinfra.com)Strong choice for MiMo-V2.5: offers a developer-friendly hosted API for open-weights models with fast onboarding, OpenAI-compatible endpoints, and straightforward production deployment vs self-hosting.Teams that want hosted open-weights without running their own MoE inference stack; rapid prototyping → production.Confirm exact input/output + caching pricing, context limits, rate limits, regions, and whether image/audio input is supported for this model.
Xiaomi (first-party API)Baseline reference for the page’s measured metrics: 87.2 t/s, TTFT 2.76s, pricing $0.14 in / $0.28 out per 1M tokens; cache hit $0.003.When you want canonical pricing/perf aligned to the benchmark source.Verify uptime/SLA, global latency from your region, and any throughput constraints at peak.
Self-host (using open weights)Maximum control: deployment topology, data residency, custom batching/quantization, and integration with your infra; MIT license supports commercial use.Regulated workloads, strict data control, or when you can optimize cost at high volume.Hardware requirements for 310B MoE, engineering overhead, ops burden, and achieving similar speed/TTFT to hosted APIs.

What is MiMo-V2.5?

Released on April 22, 2026 by Xiaomi, MiMo-V2.5 represents a major step forward in agentic capability and multimodal understanding. The model is part of Xiaomi’s MiMo family, which has rapidly gained traction in the open-weights space for its combination of frontier-level intelligence and aggressive pricing.

MiMo-V2.5 is a native omnimodal model with strong agentic capabilities, supporting text, image, video, and audio understanding within a unified architecture. Built upon the MiMo-V2-Flash backbone and extended with dedicated vision and audio encoders (both pretrained in-house and connected through lightweight projectors), it delivers robust performance across multimodal perception, long-context reasoning, and agentic workflows.

What makes MiMo-V2.5 particularly notable is its position on the price-performance frontier. Xiaomi has positioned this model family as among the first with strong cost-parity for high-volume agentic coding workloads, making previously cost-prohibitive AI applications more practical for production deployment.

MiMo-V2.5: Overall Model Analysis & Technical Specifications

Before evaluating the infrastructure providers, it is useful to understand the technical architecture and baseline capabilities of MiMo-V2.5. The model is optimized for complex reasoning, long-horizon agentic tasks, and multimodal processing.

  • Architecture & Parameter Count: MiMo-V2.5 utilizes a sparse Mixture of Experts (MoE) architecture. It features 310 billion total parameters, with only 15 billion active parameters executed during each inference forward pass. This sparsity aims to deliver frontier-level intelligence with efficient compute costs. The language backbone inherits from MiMo-V2-Flash’s hybrid sliding-window attention architecture.
  • Training Data: The model was trained on approximately 48 trillion tokens, encompassing diverse text data for pre-training, multimodal projector warmup, high-quality multimodal pre-training data, and supervised fine-tuning with diverse agentic data.
  • Context Window: The model supports a 1.0M token context window (roughly 1,500 A4 pages of text). This makes it well-suited for Retrieval-Augmented Generation (RAG) and long-context document analysis.
  • Intelligence Index: MiMo-V2.5 scores a 40 on the Artificial Analysis Intelligence Index, placing it well above the median score of 25 for comparable open-weights models.
  • Modality & Reasoning: As an omnimodal reasoning model, it processes text, image, video, and audio inputs (outputting text). It natively utilizes extended chain-of-thought “thinking” time to solve complex physics, coding, and mathematical problems.
  • Benchmark Performance: On agentic benchmarks, MiMo-V2.5 delivers strong performance. On Claw-Eval, a benchmark for daily agentic tasks, it achieves a 62.3 on the general subset, placing it near the Pareto frontier of performance and efficiency. Across image, video, and multimodal agentic tasks, MiMo-V2.5 remains competitive with frontier closed-source models.
  • Licensing: It is released under the MIT License, allowing for unrestricted commercial use, modification, and self-hosting.

Top MiMo-V2.5 API Providers Analyzed

Based on benchmarking of throughput (tokens per second – t/s), latency (Time to First Token – TTFT), and pricing models, here is a technical comparison of the top MiMo-V2.5 API providers.

ProviderBest ForOutput Speed (t/s)Latency (TTFT)Input Price (per 1M)Output Price (per 1M)
DeepInfraSpeed & Low Latency~130+Market Lowest$0.40$2.00
XiaomiCost-Efficiency & Caching87.22.76s$0.14 ($0.003 Cached)$0.28
NovitaFallback / Routing86.0AverageBlended TiersBlended Tiers
ParasailAsynchronous Budgets~65-69HigherUltra-LowUltra-Low

1. DeepInfra (Best for Speed and Low Latency)

The Verdict: DeepInfra is the top recommended API provider for MiMo-V2.5 due to its strong token throughput and latency optimization.

For enterprise and agentic workflows where response speed is the primary bottleneck, DeepInfra is a strong choice. Benchmarks indicate that DeepInfra achieves peak output speeds exceeding 130 tokens per second (t/s), significantly outpacing the first-party baseline. DeepInfra also offers some of the lowest reported latency in the market for this model, minimizing the Time to First Token (TTFT).

Because MiMo-V2.5 requires “thinking time” before generating an answer, minimizing baseline network and compute latency is important. DeepInfra’s optimized MoE inference engine makes it a strong choice for production-grade, real-time applications.

DeepInfra is a serverless inference platform that hosts open-weight AI models as API endpoints. The API is OpenAI-compatible, so switching from OpenAI usually means changing a base URL and API key. DeepInfra runs its own infrastructure including NVIDIA Blackwell B200 systems, and the platform supports streaming responses, function calling, JSON mode, and structured output. The company is SOC 2 and ISO 27001 certified.

2. Xiaomi First-Party API (Best for Cost-Efficiency and Prompt Caching)

The Verdict: The best choice for developers prioritizing raw cost-efficiency, large context windows, and official first-party support.

As the creator of the model, Xiaomi offers a highly competitive first-party API. Technically, it delivers a solid 87.2 output tokens per second and a TTFT of 2.76 seconds. Where Xiaomi truly shines is its aggressive pricing structure. It charges just $0.14 per 1M input tokens and $0.28 per 1M output tokens. Xiaomi also offers a Cache Hit Price of $0.003 per 1M tokens.

For RAG applications relying heavily on prompt caching (e.g., feeding the same massive codebase into the 1M context window repeatedly), Xiaomi’s API provides strong economic value. The platform also offers OpenAI- and Anthropic-compatible APIs with comprehensive documentation and low-latency inference.

Note: Xiaomi is deprecating the V2 series by June 30, 2026. Developers using MiMo-V2-Flash or related models should migrate to the V2.5 series before the legacy names expire.

3. Novita (Best Fallback and Routing Option)

The Verdict: A solid middle-ground provider offering competitive speeds and reliable uptime for multi-API architectures.

Novita serves as a strong alternative routing option. Clocking in at roughly 86.0 tokens per second, its throughput is nearly identical to Xiaomi’s first-party offering. While it does not reach the speeds of DeepInfra, Novita maintains consistent latency metrics and offers competitive blended pricing tiers. It is a reasonable fallback provider in multi-API routing setups to help maintain high availability for agentic workflows.

4. Parasail (Best for Asynchronous, Extreme-Budget Workloads)

The Verdict: A niche provider suited for offline bulk processing where cost is more important than immediate speed.

While Parasail trails behind DeepInfra and Xiaomi in raw output speed (averaging in the high 60s for t/s) and has slightly higher latency, it competes on price. For offline processing tasks, bulk data extraction, or asynchronous RAG pipelines where end-to-end response time is not mission-critical, Parasail’s low blended token pricing makes it a viable secondary option.

Frequently Asked Questions (FAQ)

What is the parameter size of MiMo-V2.5?

MiMo-V2.5 is a sparse Mixture of Experts (MoE) model featuring 310 billion total parameters. It is optimized for inference compute, utilizing only 15 billion active parameters per token during generation.

What is the maximum context window for MiMo-V2.5?

The model natively supports a 1,000,000 (1M) token context window. This is large enough to process approximately 1,500 standard A4 pages of text, or entire codebases, in a single prompt.

Is MiMo-V2.5 a multimodal model?

Yes. MiMo-V2.5 is a native omnimodal model supporting text, image, video, and audio inputs. It can analyze, describe, and reason over visual and audio data alongside text, though its final output modality is strictly text.

How much training data was MiMo-V2.5 trained on?

MiMo-V2.5 was trained on approximately 48 trillion tokens, including diverse text data, multimodal pre-training data, and supervised fine-tuning with agentic data.

Why is DeepInfra recommended over the first-party Xiaomi API?

While Xiaomi offers superior pricing and prompt caching rates, DeepInfra provides stronger technical performance. DeepInfra yields significantly higher output tokens per second (130+ t/s) and lower Time to First Token (TTFT), making it the optimal choice for latency-sensitive applications. For cost-sensitive batch workloads, Xiaomi remains the better choice.

Is MiMo-V2.5 open source?

Yes, MiMo-V2.5 is an open-weights model released by Xiaomi under the permissive MIT license, meaning it is fully available for unrestricted commercial use, modification, and self-hosting. Weights, tokenizer, and the full model card are available on Hugging Face.

What hardware do I need to self-host MiMo-V2.5?

Self-hosting MiMo-V2.5 requires significant GPU resources due to its 310B parameter MoE architecture. Consumer GPUs do not have enough VRAM — enterprise-grade hardware such as a well-equipped workstation or cloud GPU instances is needed. Refer to the SGLang MiMo-V2.5 Cookbook for the latest deployment guide.

How does MiMo-V2.5 compare to MiMo-V2.5-Pro?

MiMo-V2.5-Pro is Xiaomi’s larger flagship variant with 1.02 trillion total parameters and 42 billion active parameters. It offers higher intelligence scores and is designed for more demanding long-horizon tasks, but at a higher price point. MiMo-V2.5 offers a strong balance of capability and cost for most production use cases.

Does DeepInfra support the full 1M context window for MiMo-V2.5?

Context window support may vary by provider. DeepInfra’s deployment may have different context limits than Xiaomi’s native API. Always confirm the exact context limits with your chosen provider before committing to production workloads.

Related articles
How to Use OpenClaw with DeepInfra: Setup & Workflow GuideHow to Use OpenClaw with DeepInfra: Setup & Workflow Guide<p>When you first learn how to use OpenClaw, the onboarding flow asks for an API key and points you toward Anthropic or OpenAI. Reasonable starting point. For production agents running dozens of tasks a day, it&#8217;s an expensive one. OpenClaw works with any OpenAI-compatible API, so you can swap the default model for an open-weight [&hellip;]</p>
How Mixture of Experts Models Changed LLM EconomicsHow Mixture of Experts Models Changed LLM Economics<p>Every open-weight model that has closed the gap with GPT-5.5 and Claude Opus 4.7 this year has one thing in common. DeepSeek V4-Pro: 1.6 trillion parameters, 49 billion active per token. Kimi K2.6: 1 trillion parameters, 32 billion active. GLM-5.1: 744 billion parameters, 40 billion active. MiniMax M2.7: large total parameter count, 10 billion active [&hellip;]</p>
Best API Providers for NVIDIA Nemotron 3 Super 120BBest API Providers for NVIDIA Nemotron 3 Super 120B<p>Nemotron 3 Super 120B is available across a growing number of hosted APIs and deployment platforms. At 120B total parameters with 12B active per inference pass, the right provider matters: latency, throughput, and cost vary significantly depending on where you run it. This guide covers the top options by use case — from fully managed [&hellip;]</p>