MiMo-V2.5 Provider Pricing and Deployment Guide

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.07.01 by han

MiMo-V2.5 is worth paying attention to because it puts three things developers usually have to trade off into the same conversation: open weights, a 1 million-token model design, and pricing that can be unusually low depending on where you buy it. On Xiaomi’s first-party API, Artificial Analysis lists MiMo-V2.5 at $0.14 per 1M input tokens and $0.28 per 1M output tokens, with a $0.003 per 1M cache-hit price and a blended rate of $0.06 per 1M tokens under a 7:2:1 cache/input/output mix. That is the kind of pricing profile that makes engineers stop treating “large multimodal reasoning model” as automatically expensive.

At a basic level, MiMo-V2.5 is a Xiaomi model released on April 22, 2026, and published as open weights under the MIT license. It is a sparse Mixture of Experts model with 310 billion total parameters and 15 billion active parameters per inference. The model family is positioned as multimodal: the research consistently supports text and image input, while DeepInfra’s technical page describes it as a native omnimodal system with text, image, video, and audio support. The headline context number is 1 million tokens for the MiMo-V2.5 model itself, though provider implementations can differ — DeepInfra’s API spec lists a 262,144-token context window on its endpoint.

What makes MiMo-V2.5 interesting is not just that it is big, but that it looks unusually practical. Artificial Analysis estimates an Intelligence Index score of 40, versus a median of 25 for comparable open-weight models, and reports 87.2 tokens per second output speed on Xiaomi’s API, above the comparable-model median of 68.7 t/s. It also supports reasoning-style extended thinking, and because the weights are openly available on Hugging Face under MIT, teams can evaluate hosted access against self-hosting without changing model family.

If you are evaluating this model for production, the real question is not whether MiMo-V2.5 is “good” in the abstract. It is whether its price, context length, multimodal support, and deployment options line up with your workload. Teams optimizing raw token economics will care about Xiaomi’s first-party rates and OpenRouter’s lower listed input price; teams that need managed endpoints, private deployment, JSON mode, function calling, and a straightforward way to operationalize the model will probably care more about what DeepInfra offers than about chasing the absolute cheapest token.

MiMo-V2.5 Pricing Summary

MiMo-V2.5 spans a wide pricing range depending on provider: Xiaomi’s first-party API is listed by Artificial Analysis at $0.14 input / $0.28 output per 1M tokens with a $0.06 blended rate in a cache-heavy mix; OpenRouter lists $0.105 input / $0.28 output per 1M tokens; and DeepInfra prices it at $0.40 input / $2.00 output on standard tier. In practice, this model is best suited for developers who want an open-weight Xiaomi model with long-context and multimodal capabilities. DeepInfra is the stronger fit when deployment control and platform features matter more than minimum token price, while Xiaomi and OpenRouter are the better pure-cost benchmarks.

Best For	Provider	Why
Proprietary or managed model access	DeepInfra Private Endpoint	DeepInfra offers private endpoint deployment via its dashboard and supports JSON mode, function calling, and multimodal features.
RAG, document-heavy, or high-throughput use cases	DeepInfra Standard Endpoint	DeepInfra exposes MiMo-V2.5 on a public endpoint with cached input pricing and a 262,144-token API context window, useful for long prompts and structured production workloads.
Lowest price / cost-sensitive workloads	Xiaomi first-party API	Artificial Analysis lists the lowest fully detailed cost structure here: $0.14 input, $0.28 output, $0.003 cache-hit pricing, and a $0.06 blended rate under a cache-heavy usage mix.
Easiest onboarding / fastest time-to-first-call	OpenRouter	OpenAI-compatible API; integration can be as simple as swapping the model slug to xiaomi/mimo-v2.5.
Lowest listed input price	OpenRouter	Lists input at $0.105 per 1M tokens, lower than Xiaomi’s $0.14 and DeepInfra’s $0.40 standard-tier price.

Understanding Tokens and How You’re Charged

Token pricing is where model comparisons get deceptively messy. “Cheap per million” does not always mean “cheap for my workload,” especially once long prompts, repeated context, and output-heavy tasks get involved.

For MiMo-V2.5, there are four token buckets to think about:

Token type	What it is	Why it matters
Input tokens	The tokens you send in the request: system prompt, user prompt, tool schemas, documents, chat history, and multimodal text-side payload	This is your baseline prompt cost. Long-context RAG, agent state, and big instruction blocks push this up fast.
Output tokens	The tokens the model generates back	Reasoning-heavy or verbose tasks can make output cost dominate, especially on providers with a large output/input price gap.
Cached input tokens	Prompt tokens the provider can reuse from earlier requests instead of billing as full fresh input	This is where repeated prefixes get much cheaper. It matters a lot for chat apps, agents, and document workflows with stable context.
Context window tokens	The maximum total tokens the model can attend to in one request	Not a separate billing category, but it controls whether you can actually use the giant prompts you are planning to pay for. Provider limits matter here.

Input tokens are where long prompts hurt. If you are stuffing in large documents, retrieved chunks, or a huge system prompt, input pricing matters more than benchmark screenshots.
Output tokens are where “thinking” can get expensive. MiMo-V2.5 is a reasoning-style model. If your app encourages long answers, chain-of-thought-like behavior, or structured multi-step output, output pricing can become the real bill driver.
Cached input tokens are the escape hatch. If your app reuses the same instructions, tool definitions, or document prefix across requests, cache pricing can materially change total cost. This is why Xiaomi’s listed blended rate looks so low under a cache-heavy mix.
Context window is the part people brag about and then quietly cap in production. MiMo-V2.5 the model is associated with a 1M-token design; DeepInfra’s API spec for this endpoint lists a 262,144-token context window. If your architecture assumes the full 1M window, verify the provider implementation before you build around it.

Token Cost Tradeoffs by Provider

Different providers make MiMo-V2.5 look like a different economic proposition. The model is the same family. The bill usually is not.

Provider	Token cost advantages	Token cost disadvantages
Xiaomi first-party API	Lowest fully detailed pricing in the sources: $0.14/M input, $0.28/M output, $0.003/M cache hit. Artificial Analysis reports a $0.06/M blended rate under a 7:2:1 cache/input/output mix. Strong fit for repeated-context workloads where caching does real work.	Cheapest only if your usage pattern matches the pricing strengths. If your workload is output-heavy and not cache-friendly, the blended number stops being useful and you are back to raw input/output rates.
OpenRouter	Lowest listed input price at $0.105/M. Effective pricing after prompt caching can be 60–80% cheaper than listed provider price based on rolling 30-day averages. Good option when you want low entry cost and OpenAI-compatible access.	Pricing is routing-dependent and less explicit than Xiaomi’s full breakdown. No separate cache-hit line item in publicly listed rates, so forecasting repeated-prefix savings is less clean.
DeepInfra Standard	Clear public pricing and platform-friendly billing: $0.40/M input, $2.00/M output, $0.08/M cached input. Useful if you care more about managed deployment, API features, and predictable integration than absolute token minimums.	Much more expensive on raw tokens, especially output. The 5x output-to-input ratio punishes verbose generations, reasoning traces, and agent loops.
DeepInfra Priority	Same operational model as standard tier with higher service priority. Straightforward pricing: $0.60/M input, $3.00/M output, $0.12/M cached input.	Most expensive option in the set. The 1.5× multiplier over standard tier compounds fast on output-heavy traffic.

For cache-heavy chat or agent workloads: Xiaomi looks best on paper because its cache-hit price is explicitly tiny. OpenRouter may also work well, but the pricing mechanics are less transparent in the source material.
For output-heavy tasks: DeepInfra is the one to watch carefully. A model that writes long answers, tool arguments, JSON payloads, or reasoning-heavy responses can rack up output charges faster than expected there.
For long-context RAG: Low input price matters more than flashy model specs. Xiaomi and OpenRouter are easier to justify for large prompt payloads. DeepInfra may still make sense if the operational features save engineering time, but the token bill alone is not the favorable story.
For repeated prompts with stable prefixes: Cached-input pricing can matter more than nominal input price. This is where teams often mis-estimate cost by benchmarking one isolated request instead of actual session behavior.
For anyone planning around the 1M-token headline: Confirm the provider’s exposed context limit first. A cheap per-token rate does not help if your provider endpoint caps the context below your design assumptions.

DeepInfra: the Power User’s Choice for MiMo-V2.5

If you want MiMo-V2.5 with more operational control, DeepInfra is the power-user option. It runs on bare-metal infrastructure, which matters because cutting out virtualization overhead can help with more predictable performance and better cost efficiency at scale. DeepInfra is also typically 50–80% cheaper than major cloud competitors, which is exactly why it tends to appeal to developers, high-volume API users, and cost-conscious teams that still want managed deployment instead of rolling everything themselves. For teams that care about throughput, platform features, and production readiness — not just the lowest sticker price — it is an easy provider to shortlist. The broader multimodal model catalog is worth scanning to see how MiMo-V2.5 fits alongside related options.

Model Name	Best Use Case	Context Window	Input ($/1M)	Output ($/1M)
MiMo-V2.5 (Standard)	Long-context multimodal production workloads on a public endpoint	262,144 tokens	$0.40	$2.00
MiMo-V2.5 (Priority)	Higher-priority traffic where you want the same model with faster service tiering	262,144 tokens	$0.60	$3.00

DeepInfra lists MiMo-V2.5 at $0.40 per 1M input tokens and $2.00 per 1M output tokens on standard tier. That is a much stronger cost story than GPT-4o-class pricing for teams that need to run large volumes of requests, especially when paired with DeepInfra’s managed deployment options. If your workload is big enough, this is the kind of pricing gap that can materially change what is feasible in production.

Real-World Cost Scenarios for Developers

Below are practical scenarios where DeepInfra makes sense for MiMo-V2.5 — not because it is the absolute cheapest token source, but because it pairs managed deployment, multimodal support, JSON mode, function calling, and private endpoint options with still-reasonable model economics.

Scenario 1: Structured support copilot with long prompts

An internal support copilot that reads policy docs, ticket history, and tool schemas on every call, then returns structured JSON for downstream automation. This is the kind of workload where DeepInfra’s managed endpoint and JSON mode are more valuable than chasing the lowest raw token rate.

Metric	Value
Volume	5,000 requests/month
Model	MiMo-V2.5
Provider	DeepInfra Standard
Input Tokens	250M
Output Tokens	25M
Monthly Cost	$150.00

Cost breakdown:

Input: 250M × $0.40/1M = $100.00
Output: 25M × $2.00/1M = $50.00
Total = $150.00/month

DeepInfra Priority would cost $225.00/month — $75.00 more.

Why DeepInfra fits: JSON mode helps when the output needs to land in ticketing or workflow systems cleanly; function calling helps when the copilot needs to trigger internal tools; the 262,144-token API context window is a practical fit for document-heavy prompts without self-hosting complexity.

Scenario 2: Multimodal agent for product catalog operations

An agent that reviews product text plus images, classifies issues, and generates short action summaries. This plays directly into DeepInfra’s multimodal positioning for MiMo-V2.5, while keeping deployment simple through one managed API.

Metric	Value
Volume	2,000,000 items/month
Model	MiMo-V2.5
Provider	DeepInfra Standard
Input Tokens	400M
Output Tokens	40M
Monthly Cost	$240.00

Cost breakdown:

Input: 400M × $0.40/1M = $160.00
Output: 40M × $2.00/1M = $80.00
Total = $240.00/month

DeepInfra Priority would cost $360.00/month — $120.00 more.

Why DeepInfra fits: DeepInfra describes MiMo-V2.5 as supporting text, image, video, and audio on its platform. One provider for multimodal inference is simpler than stitching together separate model endpoints.

Scenario 3: Private endpoint for compliance-sensitive document workflows

Legal intake, insurance ops, or enterprise document review where teams want MiMo-V2.5 behind a private endpoint rather than on a shared public path.

Metric	Value
Volume	20,000 document runs/month
Model	MiMo-V2.5
Provider	DeepInfra Standard (Private Endpoint)
Input Tokens	600M
Output Tokens	60M
Monthly Cost	$360.00

Cost breakdown:

Input: 600M × $0.40/1M = $240.00
Output: 60M × $2.00/1M = $120.00
Total = $360.00/month

DeepInfra Priority would cost $540.00/month — $180.00 more.

Why DeepInfra fits: Private endpoint availability is the real differentiator here. You keep the MIT-licensed MiMo-V2.5 model family while moving toward a more controlled production setup, with a clear path to compare managed hosting vs. self-hosting later without changing models.

Scenario 4: High-volume agent backend with repeated prompt prefixes

A production agent stack with stable system prompts, tool definitions, and reused context across many requests. DeepInfra is not as cheap as Xiaomi on cache economics, but it still offers a meaningful cached-input discount versus fresh input while keeping the API operationally straightforward.

Metric	Value
Volume	50M cached input + 100M fresh input + 20M output tokens/month
Model	MiMo-V2.5
Provider	DeepInfra Standard
Monthly Cost	$84.00

Cost breakdown:

Cached input: 50M × $0.08/1M = $4.00
Fresh input: 100M × $0.40/1M = $40.00
Output: 20M × $2.00/1M = $40.00
Total = $84.00/month

DeepInfra Priority would cost $126.00/month — $42.00 more.

Why DeepInfra fits: Cached input is still much cheaper than fresh input; function calling support matters for agent loops; good option when you want a managed endpoint and predictable integration, not just lowest possible token pricing.

Scenario 5: Prototype on public endpoint, then scale without changing model family

A small team starts with a public endpoint for fast iteration, then moves to a private deployment path later if the product sticks. DeepInfra lets you operationalize MiMo-V2.5 early without committing to self-hosting from day one.

Metric	Value
Volume	10M input + 5M output tokens/month
Model	MiMo-V2.5
Provider	DeepInfra Standard
Monthly Cost	$14.00

Cost breakdown:

Input: 10M × $0.40/1M = $4.00
Output: 5M × $2.00/1M = $10.00
Total = $14.00/month

DeepInfra Priority would cost $21.00/month — $7.00 more.

Why DeepInfra fits: Easy path from prototype to managed production. Same provider supports public access and private endpoint deployment. Strong fit for teams that value platform features and deployment flexibility more than shaving every last cent off token costs.

Conclusion

Choosing a provider for MiMo-V2.5 is not really a question of which number looks smallest in a pricing table. It is a question of which cost structure fits your actual usage pattern — how much of your context is reused, how verbose your outputs are, how much operational scaffolding you want to manage yourself, and whether platform features like JSON mode, function calling, or private endpoints are load-bearing parts of your architecture or nice-to-haves.

The practical decision criteria come down to a few things worth being honest about before you commit. If your workload is cache-heavy and you are optimizing for raw token economics, Xiaomi’s first-party pricing is the benchmark to beat. If you want OpenAI-compatible routing with low input cost and minimal setup friction, OpenRouter is a reasonable starting point. But if you need a managed endpoint with multimodal support, predictable API behavior, and a clear path from public access to private deployment — without switching model families mid-build — DeepInfra’s positioning for MiMo-V2.5 is harder to argue with. The output pricing is higher than the alternatives, so output-heavy workloads need to be sized carefully, but the platform features absorb real engineering cost that does not show up in a per-token comparison.

One thing worth flagging before you finalize your architecture: the MiMo-V2.5 model family extends beyond the base model covered in this guide. If you need a larger reasoning model, MiMo-V2.5-Pro runs 1.02T total parameters with 42B active, with full MiMo-V2.5-Pro API documentation available for integration planning. If your pipeline touches audio, DeepInfra also hosts the MiMo-V2.5-tts API for speech synthesis, along with a voice configuration interface for tuning output characteristics. The model family is broader than the base omnimodal endpoint alone.

If you are ready to test the model directly, visit the MiMo-V2.5 model page to run a real prompt against the endpoint — the fastest way to validate fit before writing a line of integration code.

How to deploy google/flan-ul2 - simple. (open source ChatGPT alternative)Flan-UL2 is probably the best open source model available right now for chatbots. In this post we will show you how to get started with it very easily. Flan-UL2 is large - 20B parameters. It is fine tuned version of the UL2 model using Flan dataset. Because this is quite a large model it is not eas...

Art That Talks Back: A Hands-On Tutorial on Talking ImagesTurn any image into a talking masterpiece with this step-by-step guide using DeepInfra’s GenAI models.

Power the Next Era of Image Generation with FLUX.2 Visual Intelligence on DeepInfraDeepInfra is excited to support FLUX.2 from day zero, bringing the newest visual intelligence model from Black Forest Labs to our platform at launch. We make it straightforward for developers, creators, and enterprises to run the model with high performance, transparent pricing, and an API designed for productivity.

View all