DeepInfra raises $107M Series B to scale the inference cloud — read the announcement
XiaomiMiMo/
$0.40
in
$2.00
out
$0.08
cached
/ 1M tokens
MiMo-V2.5 is a native omnimodal model with strong agentic capabilities, supporting text, image, video, and audio understanding within a unified architecture. Built upon the MiMo-V2-Flash backbone and extended with dedicated vision and audio encoders, it delivers robust performance across multimodal perception, long-context reasoning, and agentic workflows.

Ask me anything
Settings
license: mit language:
config.json and tokenizer_config.json files in this repository have been updated since the initial release. If you downloaded MiMo-V2.5 before this commit (4da2748), please re-pull or manually update these two files to ensure correct model behavior. Using the outdated config may lead to degraded model performance. We apologize for any inconvenience.hf download XiaomiMiMo/MiMo-V2.5 config.json tokenizer_config.json --local-dir ./MiMo-V2.5
MiMo-V2.5 is a native omnimodal model with strong agentic capabilities, supporting text, image, video, and audio understanding within a unified architecture. Built upon the MiMo-V2-Flash backbone and extended with dedicated vision and audio encoders, it delivers robust performance across multimodal perception, long-context reasoning, and agentic workflows. Key features include:
Hybrid Attention Architecture: Inherits the hybrid design from MiMo-V2-Flash, interleaving Sliding Window Attention (SWA) and Global Attention (GA) with a 5:1 ratio and 128 sliding window. This reduces KV-cache storage by nearly 6× while maintaining long-context performance via learnable attention sink bias.
Native Omnimodal Encoders: Equipped with a 729M-param Vision Transformer (ViT) featuring hybrid window attention and a dedicated audio encoder initialized from the weights of MiMo-Audio, enabling high-quality image, video, and audio understanding.
Multi-Token Prediction (MTP): Three lightweight MTP modules with dense FFNs accelerate inference via speculative decoding and improve RL training efficiency.
Efficient Pre-Training: Trained on a total of ~48T tokens using FP8 mixed precision. The context window supports up to 1M tokens.
Agentic Capabilities: Post-training incorporates SFT, large-scale agentic RL, and Multi-Teacher On-Policy Distillation (MOPD), achieving strong performance on agentic tasks and multimodal understanding benchmarks.
| Model | Context Length | Download |
|---|---|---|
| MiMo-V2.5-Base | 256K | 🤗 HuggingFace 🤖 ModelScope |
| MiMo-V2.5 | 1M | 🤗 HuggingFace 🤖 ModelScope |
MiMo-V2.5's core language backbone inherits from the MiMo-V2-Flash architecture, a sparse MoE model with hybrid sliding window attention.
| Component | MiMo-V2.5-Pro | MiMo-V2.5 |
|---|---|---|
| Total Parameters | 1.02T | 310B |
| Activated Parameters | 42B | 15B |
| Hidden Size | 6144 | 4096 |
| Num Layers | 70 (1 dense + 69 MoE) | 48 (1 dense + 47 MoE) |
| Full Attention Layers | 10 | 9 |
| SWA Layers | 60 | 39 |
| Num Attention Heads | 128 | 64 |
| Num KV Heads | 8 (GQA) | 8 (GA) / 4 (SWA) |
| Head Dim (QK / V) | 192 / 128 | 192 / 128 |
| Routed Experts | 384 | 256 |
| Experts per Token | 8 | 8 |
| MoE Intermediate Size | 2048 | 2048 |
| Dense Intermediate Size | 16384 (layer 0 only) | 16384 (layer 0 only) |
| SWA Window Size | 128 | 128 |
| Max Context Length | 1M | 1M |
| MTP Layers | 3 | 3 |
We train a dedicated MiMo ViT that adopts sliding-window attention to enable efficient visual encoding.
| Configuration | Value |
|---|---|
| Total Layers | 28 |
| SWA Layers | 24 |
| Full Attention Layers | 4 |
| Window-Attention Pattern | [-1] + [0,0,0,0,1,1,1,1,-1] × 3 |
| Attention Heads (Q / KV) | 32 / 8 |
| Head Dimensions (QK / V) | 64 / 64 |
| Sliding Window Size (L / R) | 64 / 64 |
Window pattern notation: -1 = full attention, 0 = 1-D row window, 1 = 1-D column window.
Our audio encoder is initialized from the weights of MiMo-Audio-Tokenizer and further finetuned to support high-quality audio understanding.
| Configuration | Value |
|---|---|
| Total Layers | 24 |
| SWA Layers | 12 |
| Full Attention Layers | 12 |
| Sliding Window Size | 128 |
| Attention Heads (Q / KV) | 16 / 16 |
| Head Dimensions (QK / V) | 64 / 64 |
MiMo-V2.5 is trained on a total of ~48T tokens.
© 2026 DeepInfra. All rights reserved.