DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Xiaomi’s MiMo-V2.5 collapses what used to require two separate models — frontier agentic capability and native multimodal understanding — into one. Previously, MiMo-V2-Pro handled agentic and coding tasks while MiMo-V2-Omni covered visual and audio inputs; MiMo-V2.5 replaces both. It handles text, images, video, and audio natively, extends context to 1 million tokens, and scores 71.8 on SWE-Bench Pro, surpassing its agentic-specialized predecessor.
The efficiency story is just as notable as the capability one: despite 310B total parameters, only 15B are active per forward pass thanks to a sparse Mixture-of-Experts architecture — a design that makes running a model of this caliber economically practical. The 1M-token context window is backed by a hybrid sliding-window and global attention mechanism that reduces KV-cache storage by roughly 6×, so long-context performance doesn’t come at the cost of throughput. Xiaomi released the weights under the MIT license, meaning there are no restrictions on commercial use. And now, it’s available on DeepInfra.
MiMo-V2.5 is Xiaomi’s first model to unify agentic and multimodal capabilities under a single architecture. Previously, these capabilities were split across two separate models — MiMo-V2-Pro handled agentic tasks, while MiMo-V2-Omni handled multimodal understanding. V2.5 consolidates both, adds native video and audio input, and extends the context window to 1 million tokens, all while improving on V2-Pro’s agentic benchmark scores.
The model is a sparse MoE with 310B total parameters, but only 15B are active per forward pass (256 routed experts, 8 active per token). The language backbone uses a hybrid attention design that interleaves Sliding Window Attention (SWA) and Global Attention (GA) at a 5:1 ratio with a 128-token window — this reduces KV-cache storage by roughly 6× compared to full attention, which matters a lot at 1M context. A 329M-parameter Multi-Token Prediction (MTP) module handles speculative decoding and also improves RL training efficiency.
The visual and audio encoders are both Xiaomi-pretrained in-house:
The model was trained on ~48T tokens using FP8 mixed precision across five stages: text pre-training, projector warmup, multimodal pre-training, SFT with agentic post-training (with progressive context extension: 32K → 256K → 1M), and a final RL stage using Multi-Teacher On-Policy Distillation (MOPD). The MOPD step distills from multiple teacher models simultaneously during online RL rollouts, targeting perception, reasoning, and agentic capabilities together.
On agentic and coding benchmarks, MiMo-V2.5 surpasses its predecessor MiMo-V2-Pro across the board:
| Benchmark | MiMo-V2.5 | MiMo-V2-Pro | Claude Opus 4.6 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-Bench Pro | 71.8 | 71.5 | 77.1 | 67.8 |
| MiMo Coding Bench | 62.3 | 57.8 | 70.8 | 57.8 |
| Terminal-Bench 2.0 | 56.1 | 55.0 | 57.3 | 54.2 |
On multimodal benchmarks, it also consistently beats MiMo-V2-Omni, with the most notable gap showing up in video understanding — Claw-Eval Multimodal jumps from 15.8 to 23.8, and VideoHolmes from 59.5 to 64.0:
| Benchmark | MiMo-V2.5 | MiMo-V2-Omni | Gemini 3 Pro |
|---|---|---|---|
| MMMU-Pro | 88.5 | 83.3 | 86.4 |
| Video-MME | 87.7 | 85.3 | 88.4 |
| DailyOmni | 83.5 | 80.5 | 84.2 |
| VideoHolmes | 64.0 | 59.5 | 64.2 |
The model is fully open-sourced under the MIT license, with weights available on Hugging Face in two variants: a base model (256K context) and the full instruct model (1M context). If you want to explore other models in the same family, the multimodal models page has the full listing.
MiMo-V2.5 is available as a public endpoint on DeepInfra, with private endpoint deployment also supported. Pricing is straightforward: Standard tier runs $0.40/1M input tokens and $2.00/1M output tokens, with cached input at $0.08/1M tokens. Priority tier — which reduces queuing time under load — scales to $0.60/1M input and $3.00/1M output ($0.12/1M cached). The platform-listed context window is 262,144 tokens. The model supports JSON mode, function calling, and multimodal inputs (text, image, video, and audio), and is available in English and Chinese.
DeepInfra exposes an OpenAI-compatible API with usage-based billing — swap in the base URL and your token, and it drops into any existing OpenAI client without further changes. The platform operates under a zero-data-retention policy and is SOC 2 and ISO 27001 certified.
The MiMo-V2.5 API reference covers the full parameter set, but here’s everything you need for a first call:
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-d '{
"model": "XiaomiMiMo/MiMo-V2.5",
"messages": [
{
"role": "user",
"content": "Walk me through how you would approach debugging a memory leak in a long-running Node.js service."
}
]
}'
from openai import OpenAI
client = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
response = client.chat.completions.create(
model="XiaomiMiMo/MiMo-V2.5",
messages=[
{
"role": "user",
"content": "Walk me through how you would approach debugging a memory leak in a long-running Node.js service."
}
],
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const openai = new OpenAI({
apiKey: "$DEEPINFRA_TOKEN",
baseURL: "https://api.deepinfra.com/v1/openai",
});
const response = await openai.chat.completions.create({
model: "XiaomiMiMo/MiMo-V2.5",
messages: [
{
role: "user",
content: "Walk me through how you would approach debugging a memory leak in a long-running Node.js service.",
},
],
});
console.log(response.choices[0].message.content);Beyond the base omnimodal model, the MiMo-V2.5 family includes a few other variants worth knowing about.
MiMo-V2.5-Pro is a dedicated MoE language model with 1.02T total parameters and 42B active parameters — larger and more capable on pure text and reasoning tasks than the base V2.5 model, and a reasonable choice when you don’t need multimodal inputs but want maximum headroom on agentic workloads. The MiMo-V2.5-Pro API reference has the full parameter documentation.
On the speech side, the family includes two TTS models. MiMo-V2.5-tts converts text to natural speech with configurable output parameters, and MiMo-V2.5-tts-voiceclone extends that with voice cloning — useful if you need consistent speaker identity across generated audio. Both are available as endpoints on DeepInfra, and you can test them directly from the MiMo-V2.5-tts-voiceclone demo page before wiring them into a pipeline.
MiMo-V2.5 is a concrete step toward consolidating the fragmented model stack that agentic development has required until now — one model handling code, reasoning, vision, audio, and long context, with an architecture efficient enough to make it practical at scale. The MIT license removes friction around commercial deployment, and the benchmark numbers suggest the capability consolidation hasn’t come at the cost of quality on either the agentic or multimodal side.
For teams building agents that need to perceive, reason, and act across diverse input types — automated code review pipelines, multi-step document analysis, tools that ingest screen recordings alongside text — this is a more coherent architecture than stitching together purpose-built models. Head to the MiMo-V2.5 model page to get your API key and start building.
Best SaaS Tools and API Providers for MiMo-V2.5<p>As LLM architectures grow increasingly complex, the introduction of the MiMo-V2.5 series represents a significant step forward in multimodal capabilities and massive context handling. Integrating a model with a 1M-token context window and native multimodal support (image, video, audio, text) introduces substantial infrastructure considerations. For developers and enterprise architects, the priorities are clear: managing inference […]</p>
DeepSeek V3.2 API Benchmarks: Latency, Throughput & Cost<p>About DeepSeek V3.2 DeepSeek V3.2 is a state-of-the-art large language model that unifies conversational speed and deep reasoning in a single 685B parameter Mixture of Experts (MoE) architecture with 37B parameters activated per token. It is built around three key technical breakthroughs: DeepSeek V3.2 achieved gold-medal performance in the 2025 International Mathematical Olympiad (IMO) and […]</p>
Qwen3.5 397B A17B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 397B A17B Qwen3.5 397B A17B is Alibaba Cloud’s largest and most capable multimodal foundation model, released in February 2026. It features a hybrid Mixture-of-Experts (MoE) architecture with 397 billion total parameters and 17 billion active parameters per inference pass, utilizing 512 experts with a routing mechanism selecting a subset per token. This sparse […]</p>
© 2026 DeepInfra. All rights reserved.