DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Executive Summary: Selecting the right API provider for Xiaomi’s MiMo-V2.5 is critical for optimizing production workflows. Based on the benchmark research, DeepInfra is the best provider for raw speed and low latency (130+ tokens/second), while Xiaomi’s first-party API is the most cost-effective, offering unmatched prompt caching discounts. This guide breaks down the model’s MoE architecture and ranks the top API providers by throughput, latency, and cost.
| Provider | Why It’s a Strong Option for MiMo-V2.5 | Best For | Key Checks Before Committing |
|---|---|---|---|
| DeepInfra (deepinfra.com) | Strong choice for MiMo-V2.5: offers a developer-friendly hosted API for open-weights models with fast onboarding, OpenAI-compatible endpoints, and straightforward production deployment vs self-hosting. | Teams that want hosted open-weights without running their own MoE inference stack; rapid prototyping → production. | Confirm exact input/output + caching pricing, context limits, rate limits, regions, and whether image/audio input is supported for this model. |
| Xiaomi (first-party API) | Baseline reference for the page’s measured metrics: 87.2 t/s, TTFT 2.76s, pricing $0.14 in / $0.28 out per 1M tokens; cache hit $0.003. | When you want canonical pricing/perf aligned to the benchmark source. | Verify uptime/SLA, global latency from your region, and any throughput constraints at peak. |
| Self-host (using open weights) | Maximum control: deployment topology, data residency, custom batching/quantization, and integration with your infra; MIT license supports commercial use. | Regulated workloads, strict data control, or when you can optimize cost at high volume. | Hardware requirements for 310B MoE, engineering overhead, ops burden, and achieving similar speed/TTFT to hosted APIs. |
Released on April 22, 2026 by Xiaomi, MiMo-V2.5 represents a major step forward in agentic capability and multimodal understanding. The model is part of Xiaomi’s MiMo family, which has rapidly gained traction in the open-weights space for its combination of frontier-level intelligence and aggressive pricing.
MiMo-V2.5 is a native omnimodal model with strong agentic capabilities, supporting text, image, video, and audio understanding within a unified architecture. Built upon the MiMo-V2-Flash backbone and extended with dedicated vision and audio encoders (both pretrained in-house and connected through lightweight projectors), it delivers robust performance across multimodal perception, long-context reasoning, and agentic workflows.
What makes MiMo-V2.5 particularly notable is its position on the price-performance frontier. Xiaomi has positioned this model family as among the first with strong cost-parity for high-volume agentic coding workloads, making previously cost-prohibitive AI applications more practical for production deployment.
Before evaluating the infrastructure providers, it is useful to understand the technical architecture and baseline capabilities of MiMo-V2.5. The model is optimized for complex reasoning, long-horizon agentic tasks, and multimodal processing.
Based on benchmarking of throughput (tokens per second – t/s), latency (Time to First Token – TTFT), and pricing models, here is a technical comparison of the top MiMo-V2.5 API providers.
| Provider | Best For | Output Speed (t/s) | Latency (TTFT) | Input Price (per 1M) | Output Price (per 1M) |
|---|---|---|---|---|---|
| DeepInfra | Speed & Low Latency | ~130+ | Market Lowest | $0.40 | $2.00 |
| Xiaomi | Cost-Efficiency & Caching | 87.2 | 2.76s | $0.14 ($0.003 Cached) | $0.28 |
| Novita | Fallback / Routing | 86.0 | Average | Blended Tiers | Blended Tiers |
| Parasail | Asynchronous Budgets | ~65-69 | Higher | Ultra-Low | Ultra-Low |
The Verdict: DeepInfra is the top recommended API provider for MiMo-V2.5 due to its strong token throughput and latency optimization.
For enterprise and agentic workflows where response speed is the primary bottleneck, DeepInfra is a strong choice. Benchmarks indicate that DeepInfra achieves peak output speeds exceeding 130 tokens per second (t/s), significantly outpacing the first-party baseline. DeepInfra also offers some of the lowest reported latency in the market for this model, minimizing the Time to First Token (TTFT).
Because MiMo-V2.5 requires “thinking time” before generating an answer, minimizing baseline network and compute latency is important. DeepInfra’s optimized MoE inference engine makes it a strong choice for production-grade, real-time applications.
DeepInfra is a serverless inference platform that hosts open-weight AI models as API endpoints. The API is OpenAI-compatible, so switching from OpenAI usually means changing a base URL and API key. DeepInfra runs its own infrastructure including NVIDIA Blackwell B200 systems, and the platform supports streaming responses, function calling, JSON mode, and structured output. The company is SOC 2 and ISO 27001 certified.
The Verdict: The best choice for developers prioritizing raw cost-efficiency, large context windows, and official first-party support.
As the creator of the model, Xiaomi offers a highly competitive first-party API. Technically, it delivers a solid 87.2 output tokens per second and a TTFT of 2.76 seconds. Where Xiaomi truly shines is its aggressive pricing structure. It charges just $0.14 per 1M input tokens and $0.28 per 1M output tokens. Xiaomi also offers a Cache Hit Price of $0.003 per 1M tokens.
For RAG applications relying heavily on prompt caching (e.g., feeding the same massive codebase into the 1M context window repeatedly), Xiaomi’s API provides strong economic value. The platform also offers OpenAI- and Anthropic-compatible APIs with comprehensive documentation and low-latency inference.
Note: Xiaomi is deprecating the V2 series by June 30, 2026. Developers using MiMo-V2-Flash or related models should migrate to the V2.5 series before the legacy names expire.
The Verdict: A solid middle-ground provider offering competitive speeds and reliable uptime for multi-API architectures.
Novita serves as a strong alternative routing option. Clocking in at roughly 86.0 tokens per second, its throughput is nearly identical to Xiaomi’s first-party offering. While it does not reach the speeds of DeepInfra, Novita maintains consistent latency metrics and offers competitive blended pricing tiers. It is a reasonable fallback provider in multi-API routing setups to help maintain high availability for agentic workflows.
The Verdict: A niche provider suited for offline bulk processing where cost is more important than immediate speed.
While Parasail trails behind DeepInfra and Xiaomi in raw output speed (averaging in the high 60s for t/s) and has slightly higher latency, it competes on price. For offline processing tasks, bulk data extraction, or asynchronous RAG pipelines where end-to-end response time is not mission-critical, Parasail’s low blended token pricing makes it a viable secondary option.
What is the parameter size of MiMo-V2.5?
MiMo-V2.5 is a sparse Mixture of Experts (MoE) model featuring 310 billion total parameters. It is optimized for inference compute, utilizing only 15 billion active parameters per token during generation.
What is the maximum context window for MiMo-V2.5?
The model natively supports a 1,000,000 (1M) token context window. This is large enough to process approximately 1,500 standard A4 pages of text, or entire codebases, in a single prompt.
Is MiMo-V2.5 a multimodal model?
Yes. MiMo-V2.5 is a native omnimodal model supporting text, image, video, and audio inputs. It can analyze, describe, and reason over visual and audio data alongside text, though its final output modality is strictly text.
How much training data was MiMo-V2.5 trained on?
MiMo-V2.5 was trained on approximately 48 trillion tokens, including diverse text data, multimodal pre-training data, and supervised fine-tuning with agentic data.
Why is DeepInfra recommended over the first-party Xiaomi API?
While Xiaomi offers superior pricing and prompt caching rates, DeepInfra provides stronger technical performance. DeepInfra yields significantly higher output tokens per second (130+ t/s) and lower Time to First Token (TTFT), making it the optimal choice for latency-sensitive applications. For cost-sensitive batch workloads, Xiaomi remains the better choice.
Is MiMo-V2.5 open source?
Yes, MiMo-V2.5 is an open-weights model released by Xiaomi under the permissive MIT license, meaning it is fully available for unrestricted commercial use, modification, and self-hosting. Weights, tokenizer, and the full model card are available on Hugging Face.
What hardware do I need to self-host MiMo-V2.5?
Self-hosting MiMo-V2.5 requires significant GPU resources due to its 310B parameter MoE architecture. Consumer GPUs do not have enough VRAM — enterprise-grade hardware such as a well-equipped workstation or cloud GPU instances is needed. Refer to the SGLang MiMo-V2.5 Cookbook for the latest deployment guide.
How does MiMo-V2.5 compare to MiMo-V2.5-Pro?
MiMo-V2.5-Pro is Xiaomi’s larger flagship variant with 1.02 trillion total parameters and 42 billion active parameters. It offers higher intelligence scores and is designed for more demanding long-horizon tasks, but at a higher price point. MiMo-V2.5 offers a strong balance of capability and cost for most production use cases.
Does DeepInfra support the full 1M context window for MiMo-V2.5?
Context window support may vary by provider. DeepInfra’s deployment may have different context limits than Xiaomi’s native API. Always confirm the exact context limits with your chosen provider before committing to production workloads.
How to Use OpenClaw with DeepInfra: Setup & Workflow Guide<p>When you first learn how to use OpenClaw, the onboarding flow asks for an API key and points you toward Anthropic or OpenAI. Reasonable starting point. For production agents running dozens of tasks a day, it’s an expensive one. OpenClaw works with any OpenAI-compatible API, so you can swap the default model for an open-weight […]</p>
How Mixture of Experts Models Changed LLM Economics<p>Every open-weight model that has closed the gap with GPT-5.5 and Claude Opus 4.7 this year has one thing in common. DeepSeek V4-Pro: 1.6 trillion parameters, 49 billion active per token. Kimi K2.6: 1 trillion parameters, 32 billion active. GLM-5.1: 744 billion parameters, 40 billion active. MiniMax M2.7: large total parameter count, 10 billion active […]</p>
Best API Providers for NVIDIA Nemotron 3 Super 120B<p>Nemotron 3 Super 120B is available across a growing number of hosted APIs and deployment platforms. At 120B total parameters with 12B active per inference pass, the right provider matters: latency, throughput, and cost vary significantly depending on where you run it. This guide covers the top options by use case — from fully managed […]</p>
© 2026 DeepInfra. All rights reserved.