We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

XiaomiMiMo/

MiMo-V2.5

$0.40

in

$2.00

out

$0.08

cached

/ 1M tokens

MiMo-V2.5 is a native omnimodal model with strong agentic capabilities, supporting text, image, video, and audio understanding within a unified architecture. Built upon the MiMo-V2-Flash backbone and extended with dedicated vision and audio encoders, it delivers robust performance across multimodal perception, long-context reasoning, and agentic workflows.

Deploy Private Endpoint
Public
262,144
JSON
Function
XiaomiMiMo/MiMo-V2.5 cover image
XiaomiMiMo/MiMo-V2.5 cover image
MiMo-V2.5

Ask me anything

0.00s

Settings

Model Information

license: mit language:

  • en
  • zh tags:
  • multimodal
  • vision-language
  • audio
  • agent
  • video-understanding
  • long-context



Xiaomi-MiMo


Community
WeChat Group  |  Discord  |  Telegram  |  Reddit

⚠️ Important: Config Update Notice

The config.json and tokenizer_config.json files in this repository have been updated since the initial release. If you downloaded MiMo-V2.5 before this commit (4da2748), please re-pull or manually update these two files to ensure correct model behavior. Using the outdated config may lead to degraded model performance. We apologize for any inconvenience.

Quick fix:
hf download XiaomiMiMo/MiMo-V2.5 config.json tokenizer_config.json --local-dir ./MiMo-V2.5

MiMo-V2.5

1. Introduction

MiMo-V2.5 is a native omnimodal model with strong agentic capabilities, supporting text, image, video, and audio understanding within a unified architecture. Built upon the MiMo-V2-Flash backbone and extended with dedicated vision and audio encoders, it delivers robust performance across multimodal perception, long-context reasoning, and agentic workflows. Key features include:

  • Hybrid Attention Architecture: Inherits the hybrid design from MiMo-V2-Flash, interleaving Sliding Window Attention (SWA) and Global Attention (GA) with a 5:1 ratio and 128 sliding window. This reduces KV-cache storage by nearly 6× while maintaining long-context performance via learnable attention sink bias.

  • Native Omnimodal Encoders: Equipped with a 729M-param Vision Transformer (ViT) featuring hybrid window attention and a dedicated audio encoder initialized from the weights of MiMo-Audio, enabling high-quality image, video, and audio understanding.

  • Multi-Token Prediction (MTP): Three lightweight MTP modules with dense FFNs accelerate inference via speculative decoding and improve RL training efficiency.

  • Efficient Pre-Training: Trained on a total of ~48T tokens using FP8 mixed precision. The context window supports up to 1M tokens.

  • Agentic Capabilities: Post-training incorporates SFT, large-scale agentic RL, and Multi-Teacher On-Policy Distillation (MOPD), achieving strong performance on agentic tasks and multimodal understanding benchmarks.

MiMo-V2.5 Architecture

Model Summary

  • Architecture: Sparse MoE (Mixture of Experts), 310B total / 15B activated parameters
  • Context Length: Up to 1M tokens
  • Modalities: Text, Image, Video, Audio
  • Vision Encoder: 729M-param ViT (28 layers: 24 SWA + 4 Full)
  • Audio Encoder: 261M-param Audio Transformer (24 layers: 12 SWA + 12 Full)
  • Multi-Token Prediction (MTP): 329M parameters, 3 layers

2. Downloads

ModelContext LengthDownload
MiMo-V2.5-Base256K🤗 HuggingFace
🤖 ModelScope
MiMo-V2.51M🤗 HuggingFace
🤖 ModelScope

3. Evaluation Results

Multimodal Benchmarks

MiMo-V2.5 Multimodal Benchmark Results

Coding & Agent Benchmarks

MiMo-V2.5 Coding and Agentic Benchmark Results

Long Context Benchmarks

MiMo-V2.5 Graphwalks

4. Model Architecture

LLM Backbone

MiMo-V2.5's core language backbone inherits from the MiMo-V2-Flash architecture, a sparse MoE model with hybrid sliding window attention.

ComponentMiMo-V2.5-ProMiMo-V2.5
Total Parameters1.02T310B
Activated Parameters42B15B
Hidden Size61444096
Num Layers70 (1 dense + 69 MoE)48 (1 dense + 47 MoE)
Full Attention Layers109
SWA Layers6039
Num Attention Heads12864
Num KV Heads8 (GQA)8 (GA) / 4 (SWA)
Head Dim (QK / V)192 / 128192 / 128
Routed Experts384256
Experts per Token88
MoE Intermediate Size20482048
Dense Intermediate Size16384 (layer 0 only)16384 (layer 0 only)
SWA Window Size128128
Max Context Length1M1M
MTP Layers33

Vision Encoder

We train a dedicated MiMo ViT that adopts sliding-window attention to enable efficient visual encoding.

ConfigurationValue
Total Layers28
SWA Layers24
Full Attention Layers4
Window-Attention Pattern[-1] + [0,0,0,0,1,1,1,1,-1] × 3
Attention Heads (Q / KV)32 / 8
Head Dimensions (QK / V)64 / 64
Sliding Window Size (L / R)64 / 64

Window pattern notation: -1 = full attention, 0 = 1-D row window, 1 = 1-D column window.

Audio Encoder

Our audio encoder is initialized from the weights of MiMo-Audio-Tokenizer and further finetuned to support high-quality audio understanding.

ConfigurationValue
Total Layers24
SWA Layers12
Full Attention Layers12
Sliding Window Size128
Attention Heads (Q / KV)16 / 16
Head Dimensions (QK / V)64 / 64

5. Training Process

MiMo-V2.5 is trained on a total of ~48T tokens.

  1. Text Pre-training: We collect diverse text data for pre-training the LLM backbone.
  2. Projector Warmup: Short-duration warmup of multimodal projectors (audio and visual MLP projectors).
  3. Multimodal Pre-training: High-quality multimodal data collected for large-scale pretraining.
  4. SFT & Agentic Post Training: Supervised fine-tuning with diverse agentic data. During this stage, the context window is progressively extended from 32K → 256K → 1M.
  5. RL & MOPD Training: Reinforcement learning for improving perception, reasoning, and agentic capabilities.