XiaomiMiMo/

MiMo-V2.5

$0.40

in

$2.00

out

$0.08

cached

/ 1M tokens

MiMo-V2.5 is a native omnimodal model with strong agentic capabilities, supporting text, image, video, and audio understanding within a unified architecture. Built upon the MiMo-V2-Flash backbone and extended with dedicated vision and audio encoders, it delivers robust performance across multimodal perception, long-context reasoning, and agentic workflows.

Deploy Private Endpoint

Public

262,144

JSON

Function

api versions

MiMo-V2.5

Ask me anything

0.00s

Settings

Model Information

license: mit language:

en
zh tags:
multimodal
vision-language
audio
agent
video-understanding
long-context

Community
WeChat Group | Discord | Telegram | Reddit

⚠️ Important: Config Update Notice

The config.json and tokenizer_config.json files in this repository have been updated since the initial release. If you downloaded MiMo-V2.5 before this commit (4da2748), please re-pull or manually update these two files to ensure correct model behavior. Using the outdated config may lead to degraded model performance. We apologize for any inconvenience.

Quick fix:

hf download XiaomiMiMo/MiMo-V2.5 config.json tokenizer_config.json --local-dir ./MiMo-V2.5

MiMo-V2.5

1. Introduction

MiMo-V2.5 is a native omnimodal model with strong agentic capabilities, supporting text, image, video, and audio understanding within a unified architecture. Built upon the MiMo-V2-Flash backbone and extended with dedicated vision and audio encoders, it delivers robust performance across multimodal perception, long-context reasoning, and agentic workflows. Key features include:

Hybrid Attention Architecture: Inherits the hybrid design from MiMo-V2-Flash, interleaving Sliding Window Attention (SWA) and Global Attention (GA) with a 5:1 ratio and 128 sliding window. This reduces KV-cache storage by nearly 6× while maintaining long-context performance via learnable attention sink bias.
Native Omnimodal Encoders: Equipped with a 729M-param Vision Transformer (ViT) featuring hybrid window attention and a dedicated audio encoder initialized from the weights of MiMo-Audio, enabling high-quality image, video, and audio understanding.
Multi-Token Prediction (MTP): Three lightweight MTP modules with dense FFNs accelerate inference via speculative decoding and improve RL training efficiency.
Efficient Pre-Training: Trained on a total of ~48T tokens using FP8 mixed precision. The context window supports up to 1M tokens.
Agentic Capabilities: Post-training incorporates SFT, large-scale agentic RL, and Multi-Teacher On-Policy Distillation (MOPD), achieving strong performance on agentic tasks and multimodal understanding benchmarks.

Model Summary

Architecture: Sparse MoE (Mixture of Experts), 310B total / 15B activated parameters
Context Length: Up to 1M tokens
Modalities: Text, Image, Video, Audio
Vision Encoder: 729M-param ViT (28 layers: 24 SWA + 4 Full)
Audio Encoder: 261M-param Audio Transformer (24 layers: 12 SWA + 12 Full)
Multi-Token Prediction (MTP): 329M parameters, 3 layers

2. Downloads

Model	Context Length	Download
MiMo-V2.5-Base	256K	🤗 HuggingFace 🤖 ModelScope
MiMo-V2.5	1M	🤗 HuggingFace 🤖 ModelScope

3. Evaluation Results

Multimodal Benchmarks

Coding & Agent Benchmarks

MiMo-V2.5 Coding and Agentic Benchmark Results

Long Context Benchmarks

4. Model Architecture

LLM Backbone

MiMo-V2.5's core language backbone inherits from the MiMo-V2-Flash architecture, a sparse MoE model with hybrid sliding window attention.

Component	MiMo-V2.5-Pro	MiMo-V2.5
Total Parameters	1.02T	310B
Activated Parameters	42B	15B
Hidden Size	6144	4096
Num Layers	70 (1 dense + 69 MoE)	48 (1 dense + 47 MoE)
Full Attention Layers	10	9
SWA Layers	60	39
Num Attention Heads	128	64
Num KV Heads	8 (GQA)	8 (GA) / 4 (SWA)
Head Dim (QK / V)	192 / 128	192 / 128
Routed Experts	384	256
Experts per Token	8	8
MoE Intermediate Size	2048	2048
Dense Intermediate Size	16384 (layer 0 only)	16384 (layer 0 only)
SWA Window Size	128	128
Max Context Length	1M	1M
MTP Layers	3	3

Vision Encoder

We train a dedicated MiMo ViT that adopts sliding-window attention to enable efficient visual encoding.

Configuration	Value
Total Layers	28
SWA Layers	24
Full Attention Layers	4
Window-Attention Pattern	[-1] + [0,0,0,0,1,1,1,1,-1] × 3
Attention Heads (Q / KV)	32 / 8
Head Dimensions (QK / V)	64 / 64
Sliding Window Size (L / R)	64 / 64

Window pattern notation: -1 = full attention, 0 = 1-D row window, 1 = 1-D column window.

Audio Encoder

Our audio encoder is initialized from the weights of MiMo-Audio-Tokenizer and further finetuned to support high-quality audio understanding.

Configuration	Value
Total Layers	24
SWA Layers	12
Full Attention Layers	12
Sliding Window Size	128
Attention Heads (Q / KV)	16 / 16
Head Dimensions (QK / V)	64 / 64

5. Training Process

MiMo-V2.5 is trained on a total of ~48T tokens.

Text Pre-training: We collect diverse text data for pre-training the LLM backbone.
Projector Warmup: Short-duration warmup of multimodal projectors (audio and visual MLP projectors).
Multimodal Pre-training: High-quality multimodal data collected for large-scale pretraining.
SFT & Agentic Post Training: Supervised fine-tuning with diverse agentic data. During this stage, the context window is progressively extended from 32K → 256K → 1M.
RL & MOPD Training: Reinforcement learning for improving perception, reasoning, and agentic capabilities.