DeepInfra raises $107M Series B to scale the inference cloud — read the announcement
XiaomiMiMo/
$1.00
in
$3.00
out
$0.20
cached
/ 1M tokens
MiMo-V2.5-Pro is an open-source Mixture-of-Experts (MoE) language model with 1.02T total parameters and 42B active parameters. It utilizes the hybrid attention architecture and 3-layers Multi-Token Prediction (MTP) introduced in [MiMo-V2-Flash](https://github.com/XiaomiMiMo/MiMo-V2-Flash).

Ask me anything
Settings
license: mit language:
MiMo-V2.5-Pro is an open-source Mixture-of-Experts (MoE) language model with 1.02T total parameters and 42B active parameters. It utilizes the hybrid attention architecture and 3-layers Multi-Token Prediction (MTP) introduced in MiMo-V2-Flash, with up to 1M tokens context length.
MiMo-V2.5-Pro is our most capable model to date, designed for the most demanding agentic, complex software engineering, and long-horizon tasks. It sustains complex trajectories spanning thousands of tool calls with strong instruction following and coherence over a 1M-token context window. Key features include:
| Model | Total Params | Active Params | Context Length | Precision | Download |
|---|---|---|---|---|---|
| MiMo-V2.5-Pro | 1.02T | 42B | 1M | FP8 (E4M3) Mixed | 🤗 HuggingFace 🤖 ModelScope |
| MiMo-V2.5-Pro-Base | 1.02T | 42B | 256K | FP8 (E4M3) Mixed | 🤗 HuggingFace 🤖 ModelScope |
| Category | Benchmark | Setting | MiMo-V2.5-Pro Base | MiMo-V2.5 Base | DeepSeek-V4-Pro Base | DeepSeek-V4-Flash Base | Kimi-K2 Base |
|---|---|---|---|---|---|---|---|
| Params | #Activated / #Total | - | 42B / 1.02T | 15B / 310B | 49B / 1.6T | 13B / 284B | 32B / 1.04T |
| General | BBH | 3-shot | 88.4 | 87.2 | 87.5 | 86.9 | 88.7 |
| MMLU | 5-shot | 89.4 | 86.3 | 90.1 | 88.7 | 87.8 | |
| MMLU-Redux | 5-shot | 92.8 | 89.8 | 90.8 | 89.4 | 90.2 | |
| MMLU-Pro | 5-shot | 68.5 | 65.8 | 73.5 | 68.3 | 69.2 | |
| DROP | 3-shot | 86.3 | 83.7 | 88.7 | 88.6 | 83.6 | |
| ARC-Challenge | 25-shot | 97.2 | 96.5 | - | - | 96.2 | |
| HellaSwag | 10-shot | 89.8 | 88.6 | 88.0 | 85.7 | 94.6 | |
| WinoGrande | 5-shot | 85.6 | 84.7 | 81.5 | 79.5 | 85.3 | |
| TriviaQA | 5-shot | 81.3 | 80.7 | 85.6 | 82.8 | 85.1 | |
| GPQA-Diamond | 5-shot | 66.7 | 58.1 | - | - | 48.1 | |
| Math | GSM8K | 8-shot | 99.6 | 83.3 | 92.6 | 90.8 | 92.1 |
| MATH | 4-shot | 86.2 | 67.7 | 64.5 | 57.4 | 70.2 | |
| AIME 24&25 | 2-shot | 37.3 | 36.9 | - | - | 31.6 | |
| Code | HumanEval+ | 1-shot | 75.6 | 71.3 | - | - | 84.8 |
| MBPP+ | 3-shot | 74.1 | 70.9 | - | - | 73.8 | |
| LiveCodeBench v6 | 1-shot | 39.6 | 35.5 | - | - | 26.3 | |
| SWE-Bench (AgentLess) | 3-shot | 35.7 | 30.8 | - | - | 28.2 | |
| Chinese | C-Eval | 5-shot | 91.5 | 88.6 | 93.1 | 92.1 | 92.5 |
| CMMLU | 5-shot | 90.2 | 88.2 | 90.8 | 90.4 | 90.9 | |
| Multilingual | GlobalMMLU | 5-shot | 83.6 | 77.4 | - | - | 80.7 |
GraphWalks is a long-context benchmark from OpenAI that fills the prompt with a directed graph of hex-hash nodes and asks the model to run a breadth-first search (nodes exactly at depth N) or list a node's parents. We evaluate across the full 32k–1M input-token span and apply the same evaluation fixes described by Anthropic.
MiMo V2.5 Pro delivers a major leap in long-context reasoning. Past 128k, V2 Pro degrades rapidly and collapses to 0.00 at 1M on both subtasks, while V2.5 Pro still scores 0.56 BFS / 0.92 Parents at 512k and 0.37 / 0.62 at 1M.
MiMo-V2.5-Pro addresses the quadratic complexity of long contexts by interleaving Local Sliding Window Attention (SWA) and Global Attention (GA). Unlike traditional speculative decoding, our MTP module is natively integrated for training and inference.
| Component | MiMo-V2.5-Pro | MiMo-V2.5 |
|---|---|---|
| Total Parameters | 1.02T | 310B |
| Activated Parameters | 42B | 15B |
| Hidden Size | 6144 | 4096 |
| Num Layers | 70 (1 dense + 69 MoE) | 48 (1 dense + 47 MoE) |
| Full Attention Layers | 10 | 9 |
| SWA Layers | 60 | 39 |
| Num Attention Heads | 128 | 64 |
| Num KV Heads | 8 (GQA) | 8 (GA) / 4 (SWA) |
| Head Dim (QK / V) | 192 / 128 | 192 / 128 |
| Routed Experts | 384 | 256 |
| Experts per Token | 8 | 8 |
| MoE Intermediate Size | 2048 | 2048 |
| Dense Intermediate Size | 16384 (layer 0 only) | 16384 (layer 0 only) |
| SWA Window Size | 128 | 128 |
| Max Context Length | 1M | 1M |
| MTP Layers | 3 | 3 |
For post-training, MiMo-V2.5-Pro adopts the three-stage post-training paradigm introduced in MiMo-V2-Flash to achieve exceptional performance. The paradigm begins with Supervised Fine-Tuning (SFT) to build strong, foundational instruction-following skills using curated data pairs. Next, in the Domain-Specialized Training stage, diverse teacher models — ranging from math and safety to complex agentic tool-use — are individually optimized using domain-specific RL rewards. Finally, the process culminates in Multi-Teacher On-Policy Distillation (MOPD). Through dynamic on-policy RL, the single student model iteratively learns from its own outputs, continuously receiving precise token-level guidance from the expert teachers to seamlessly integrate broad capabilities.
© 2026 DeepInfra. All rights reserved.