MiMo-V2.5 Model Documentation and Integration Guide

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.07.01 by DeepInfra

MiMo-V2.5 is a native omnimodal model developed by XiaomiMiMo, designed to process and understand text, image, video, and audio through a unified architecture rather than relying on “bolted-on” components for each modality.

Built on a 310-billion-parameter Sparse Mixture of Experts (MoE) architecture — with only 15 billion parameters activated during inference — MiMo-V2.5 offers a strong balance of high-tier reasoning and computational efficiency. With a 1-million-token context window and agentic capabilities, it is engineered for complex multimodal perception, long-context reasoning, and autonomous workflows.

Architectural Capabilities

MiMo-V2.5 represents a meaningful step forward from its predecessor, MiMo-V2-Flash. By utilizing native, dedicated encoders for diverse data types, the model achieves a level of cohesion not commonly seen in large-scale models.

Key Technical Features

Native Omnimodal Encoders: Includes a 729-million-parameter Vision Transformer with hybrid window attention and a 261-million-parameter audio encoder.
Hybrid Attention Architecture: By interleaving Sliding Window Attention (SWA) and Global Attention (GA) in a 5:1 ratio, the model reduces KV-cache storage requirements by roughly 6× without sacrificing long-context integrity.
Multi-Token Prediction (MTP): Three lightweight MTP modules (329M parameters) accelerate inference through speculative decoding and improve the efficiency of reinforcement learning.
Advanced Training: Trained on approximately 48 trillion tokens using FP8 mixed precision, the model has undergone Supervised Fine-Tuning (SFT) and Multi-Teacher On-Policy Distillation (MOPD) to perform well on agentic tasks.

Configuration Notice: Developers who downloaded the model prior to recent repository updates should re-pull the config.json and tokenizer_config.json files to ensure optimal performance and avoid degraded behavior.

Performance and Benchmarks

MiMo-V2.5 demonstrates competitive performance against frontier closed-source models, particularly in coding, temporal video reasoning, and agentic decision-making.

Agentic and Coding Performance

The model’s use of Reinforcement Learning (RL) places it near the Pareto frontier for daily agentic tasks.

Benchmark	Category	MiMo-V2.5 Score	Claude Opus 4.6	Gemini 3.1 Pro
Coding (General)	Programming/Logic	71.8	77.1	67.8
Claw-Eval Text	General Agentic	65.8	70.8	68.5
Terminal-Bench 2.0	CLI Operations	56.1	57.3	54.2

Multimodal Perception

MiMo-V2.5 shows sharp perception for temporal reasoning, matching or approaching industry leaders in video and image understanding.

Benchmark	Modality	MiMo-V2.5 Score	Gemini 3 Pro	Kimi K2.6
Image Understanding	Vision-Language	81.0	81.4	80.4
Video-MME	Video	83.5	84.2	—
MMMU-Pro	Multi-discipline	88.5	—	—
CharXiv RQ	Chart/Diagram	77.9	81.0	79.4

Long-Context Integrity

The model supports up to 1,000,000 tokens, validated through benchmarks like Graphwalks for path-finding and retrieval. A learnable attention sink bias helps reasoning accuracy remain stable even at the 1M token limit.

Getting Started with the API

MiMo-V2.5 is hosted on DeepInfra, providing high-performance, low-latency inference via an OpenAI-compatible API.

Authentication

Retrieve your API key from your DeepInfra Dashboard and include it in your HTTP headers:

Authorization: Bearer <YOUR_DEEPINFRA_API_KEY>

API Basics

Base URL: https://api.deepinfra.com/v1/openai
Endpoint: POST /chat/completions

Implementation Examples

Using cURL

curl -X POST https://api.deepinfra.com/v1/openai/chat/completions \
  -H "Authorization: Bearer $DEEPINFRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "XiaomiMiMo/MiMo-V2.5",
    "messages": [
      {
        "role": "user",
        "content": "Explain the advantages of a hybrid attention architecture in 2 sentences."
      }
    ]
  }'copy

Using Python

import os
import requests


url = "https://api.deepinfra.com/v1/openai/chat/completions"
api_key = os.getenv("DEEPINFRA_API_KEY")


payload = {
    "model": "XiaomiMiMo/MiMo-V2.5",
    "messages": [{"role": "user", "content": "Explain the advantages of a hybrid attention architecture."}]
}


response = requests.post(url, headers={"Authorization": f"Bearer {api_key}"}, json=payload)
print(response.json())copy

Pricing and Service Tiers

Pricing is usage-based, calculated per 1 million tokens. DeepInfra offers two tiers to balance cost and priority.

Pricing Table (Per 1M Tokens)

Tier	Input Price	Output Price	Cached Input Price
Standard	$0.40	$2.00	$0.08
Priority (1.5×)	$0.60	$3.00	$0.12

Key Pricing Considerations

Cached Input Discount: Tokens successfully retrieved from the cache are billed at a significantly reduced rate ($0.08/1M tokens on Standard), making long-context conversations more cost-effective.
Priority Tier: Users requiring lower latency and prioritized processing can opt for the Priority Tier, which applies a 1.5× multiplier to all costs.
Free Tier: Refer to the DeepInfra Pricing Page for current free-tier availability and limitations.

Conclusion

XiaomiMiMo’s MiMo-V2.5 is a capable and versatile model for the next generation of AI applications. By combining a 1M token context window with native omnimodal understanding and an efficient MoE architecture, it gives developers frontier-model capabilities at a comparatively lower resource cost.

Whether you are building agentic workflows, analyzing hour-long videos, or processing large document sets, MiMo-V2.5 offers the performance and flexibility for professional-grade deployment.

Kimi K2.6 Pricing Guide 2026: Compare Costs & Deployment StrategiesKimi K2.6 matters because it sits in a rare spot: open weights, broad provider availability, and a real spread in pricing and runtime performance depending on where you buy it. Artificial Analysis tracks the model across nine API providers, with blended pricing ranging from $1.15 to $2.15 per 1M tokens and major differences in throughput […]

NVIDIA Nemotron 3 Super 120B API BenchmarksNVIDIA Nemotron 3 Super 120B A12B is available across multiple API providers, and the spread in performance and cost is wide enough to change deployment decisions. Artificial Analysis benchmarks three providers — Lightning AI, CoreWeave, and Nebius — with output speed ranging from 154 to 509 t/s (a 3.3x gap), TTFT spanning 0.98s to 1.94s, […]

Qwen3.5 35B A3B API Benchmarks: Latency, Throughput & CostAbout Qwen3.5 35B A3B Qwen3.5 35B A3B is a native vision-language model released by Alibaba Cloud in February 2026. It uses a hybrid architecture that integrates Gated Delta Networks with a sparse Mixture-of-Experts model, achieving higher inference efficiency. With 35 billion total parameters and only 3 billion activated per token through 256 experts (8 routed […]

View all