We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

MiMo-V2.5 Model Documentation and Integration Guide
Published on 2026.07.01 by DeepInfra
MiMo-V2.5 Model Documentation and Integration Guide

MiMo-V2.5 is a native omnimodal model developed by XiaomiMiMo, designed to process and understand text, image, video, and audio through a unified architecture rather than relying on “bolted-on” components for each modality.

Built on a 310-billion-parameter Sparse Mixture of Experts (MoE) architecture — with only 15 billion parameters activated during inference — MiMo-V2.5 offers a strong balance of high-tier reasoning and computational efficiency. With a 1-million-token context window and agentic capabilities, it is engineered for complex multimodal perception, long-context reasoning, and autonomous workflows.

Architectural Capabilities

MiMo-V2.5 represents a meaningful step forward from its predecessor, MiMo-V2-Flash. By utilizing native, dedicated encoders for diverse data types, the model achieves a level of cohesion not commonly seen in large-scale models.

Key Technical Features

  • Native Omnimodal Encoders: Includes a 729-million-parameter Vision Transformer with hybrid window attention and a 261-million-parameter audio encoder.
  • Hybrid Attention Architecture: By interleaving Sliding Window Attention (SWA) and Global Attention (GA) in a 5:1 ratio, the model reduces KV-cache storage requirements by roughly 6× without sacrificing long-context integrity.
  • Multi-Token Prediction (MTP): Three lightweight MTP modules (329M parameters) accelerate inference through speculative decoding and improve the efficiency of reinforcement learning.
  • Advanced Training: Trained on approximately 48 trillion tokens using FP8 mixed precision, the model has undergone Supervised Fine-Tuning (SFT) and Multi-Teacher On-Policy Distillation (MOPD) to perform well on agentic tasks.

Configuration Notice: Developers who downloaded the model prior to recent repository updates should re-pull the config.json and tokenizer_config.json files to ensure optimal performance and avoid degraded behavior.

Performance and Benchmarks

MiMo-V2.5 demonstrates competitive performance against frontier closed-source models, particularly in coding, temporal video reasoning, and agentic decision-making.

Agentic and Coding Performance

The model’s use of Reinforcement Learning (RL) places it near the Pareto frontier for daily agentic tasks.

BenchmarkCategoryMiMo-V2.5 ScoreClaude Opus 4.6Gemini 3.1 Pro
Coding (General)Programming/Logic71.877.167.8
Claw-Eval TextGeneral Agentic65.870.868.5
Terminal-Bench 2.0CLI Operations56.157.354.2

Multimodal Perception

MiMo-V2.5 shows sharp perception for temporal reasoning, matching or approaching industry leaders in video and image understanding.

BenchmarkModalityMiMo-V2.5 ScoreGemini 3 ProKimi K2.6
Image UnderstandingVision-Language81.081.480.4
Video-MMEVideo83.584.2
MMMU-ProMulti-discipline88.5
CharXiv RQChart/Diagram77.981.079.4

Long-Context Integrity

The model supports up to 1,000,000 tokens, validated through benchmarks like Graphwalks for path-finding and retrieval. A learnable attention sink bias helps reasoning accuracy remain stable even at the 1M token limit.

Getting Started with the API

MiMo-V2.5 is hosted on DeepInfra, providing high-performance, low-latency inference via an OpenAI-compatible API.

Authentication

Retrieve your API key from your DeepInfra Dashboard and include it in your HTTP headers:

Authorization: Bearer <YOUR_DEEPINFRA_API_KEY>

API Basics

  • Base URL: https://api.deepinfra.com/v1/openai
  • Endpoint: POST /chat/completions

Implementation Examples

Using cURL

curl -X POST https://api.deepinfra.com/v1/openai/chat/completions \
  -H "Authorization: Bearer $DEEPINFRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "XiaomiMiMo/MiMo-V2.5",
    "messages": [
      {
        "role": "user",
        "content": "Explain the advantages of a hybrid attention architecture in 2 sentences."
      }
    ]
  }'
copy

Using Python

import os
import requests


url = "https://api.deepinfra.com/v1/openai/chat/completions"
api_key = os.getenv("DEEPINFRA_API_KEY")


payload = {
    "model": "XiaomiMiMo/MiMo-V2.5",
    "messages": [{"role": "user", "content": "Explain the advantages of a hybrid attention architecture."}]
}


response = requests.post(url, headers={"Authorization": f"Bearer {api_key}"}, json=payload)
print(response.json())
copy

Pricing and Service Tiers

Pricing is usage-based, calculated per 1 million tokens. DeepInfra offers two tiers to balance cost and priority.

Pricing Table (Per 1M Tokens)

TierInput PriceOutput PriceCached Input Price
Standard$0.40$2.00$0.08
Priority (1.5×)$0.60$3.00$0.12

Key Pricing Considerations

  • Cached Input Discount: Tokens successfully retrieved from the cache are billed at a significantly reduced rate ($0.08/1M tokens on Standard), making long-context conversations more cost-effective.
  • Priority Tier: Users requiring lower latency and prioritized processing can opt for the Priority Tier, which applies a 1.5× multiplier to all costs.
  • Free Tier: Refer to the DeepInfra Pricing Page for current free-tier availability and limitations.

Conclusion

XiaomiMiMo’s MiMo-V2.5 is a capable and versatile model for the next generation of AI applications. By combining a 1M token context window with native omnimodal understanding and an efficient MoE architecture, it gives developers frontier-model capabilities at a comparatively lower resource cost.

Whether you are building agentic workflows, analyzing hour-long videos, or processing large document sets, MiMo-V2.5 offers the performance and flexibility for professional-grade deployment.

Related articles
Best API Providers for GLM-5.1 in 2026Best API Providers for GLM-5.1 in 2026<p>GLM-5.1 is available across a growing number of API providers, and the choice between them materially affects cost, latency, and what features you can actually use. The benchmark spread is real: blended pricing runs from $0.74 to $1.70 per 1M tokens across tracked providers, output speed ranges from 33 to 175 t/s, and not every [&hellip;]</p>
Fork of Text Generation Inference.Fork of Text Generation Inference.The text generation inference open source project by huggingface looked like a promising framework for serving large language models (LLM). However, huggingface announced that they will change the license of code with version v1.0.0. While the previous license Apache 2.0 was permissive, the new on...
OpenCode: Open-Source Claude Code AlternativeOpenCode: Open-Source Claude Code Alternative<p>Open your cloud bill after a month of heavy agent use and the number stops being abstract. Teams report coding-assistant costs in the hundreds of dollars per developer, and some now set token budgets the way they once rationed cloud compute. Then in June 2026 the US government barred non-Americans from Anthropic&#8217;s Fable 5, and [&hellip;]</p>