MiMo-V2.5 Is Now Available on DeepInfra

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.07.01 by DeepInfra

Xiaomi’s MiMo-V2.5 collapses what used to require two separate models — frontier agentic capability and native multimodal understanding — into one. Previously, MiMo-V2-Pro handled agentic and coding tasks while MiMo-V2-Omni covered visual and audio inputs; MiMo-V2.5 replaces both. It handles text, images, video, and audio natively, extends context to 1 million tokens, and scores 71.8 on SWE-Bench Pro, surpassing its agentic-specialized predecessor.

The efficiency story is just as notable as the capability one: despite 310B total parameters, only 15B are active per forward pass thanks to a sparse Mixture-of-Experts architecture — a design that makes running a model of this caliber economically practical. The 1M-token context window is backed by a hybrid sliding-window and global attention mechanism that reduces KV-cache storage by roughly 6×, so long-context performance doesn’t come at the cost of throughput. Xiaomi released the weights under the MIT license, meaning there are no restrictions on commercial use. And now, it’s available on DeepInfra.

What Makes This Model Different

MiMo-V2.5 is Xiaomi’s first model to unify agentic and multimodal capabilities under a single architecture. Previously, these capabilities were split across two separate models — MiMo-V2-Pro handled agentic tasks, while MiMo-V2-Omni handled multimodal understanding. V2.5 consolidates both, adds native video and audio input, and extends the context window to 1 million tokens, all while improving on V2-Pro’s agentic benchmark scores.

Architecture

The model is a sparse MoE with 310B total parameters, but only 15B are active per forward pass (256 routed experts, 8 active per token). The language backbone uses a hybrid attention design that interleaves Sliding Window Attention (SWA) and Global Attention (GA) at a 5:1 ratio with a 128-token window — this reduces KV-cache storage by roughly 6× compared to full attention, which matters a lot at 1M context. A 329M-parameter Multi-Token Prediction (MTP) module handles speculative decoding and also improves RL training efficiency.

The visual and audio encoders are both Xiaomi-pretrained in-house:

Vision encoder: 729M-param ViT, 28 layers (24 SWA + 4 full attention)
Audio encoder: 261M-param Audio Transformer, 24 layers (12 SWA + 12 full attention), initialized from MiMo-Audio-Tokenizer weights
Both connect to the LLM backbone via lightweight MLP projectors

Training

The model was trained on ~48T tokens using FP8 mixed precision across five stages: text pre-training, projector warmup, multimodal pre-training, SFT with agentic post-training (with progressive context extension: 32K → 256K → 1M), and a final RL stage using Multi-Teacher On-Policy Distillation (MOPD). The MOPD step distills from multiple teacher models simultaneously during online RL rollouts, targeting perception, reasoning, and agentic capabilities together.

Performance

On agentic and coding benchmarks, MiMo-V2.5 surpasses its predecessor MiMo-V2-Pro across the board:

Benchmark	MiMo-V2.5	MiMo-V2-Pro	Claude Opus 4.6	Gemini 3.1 Pro
SWE-Bench Pro	71.8	71.5	77.1	67.8
MiMo Coding Bench	62.3	57.8	70.8	57.8
Terminal-Bench 2.0	56.1	55.0	57.3	54.2

On multimodal benchmarks, it also consistently beats MiMo-V2-Omni, with the most notable gap showing up in video understanding — Claw-Eval Multimodal jumps from 15.8 to 23.8, and VideoHolmes from 59.5 to 64.0:

Benchmark	MiMo-V2.5	MiMo-V2-Omni	Gemini 3 Pro
MMMU-Pro	88.5	83.3	86.4
Video-MME	87.7	85.3	88.4
DailyOmni	83.5	80.5	84.2
VideoHolmes	64.0	59.5	64.2

The model is fully open-sourced under the MIT license, with weights available on Hugging Face in two variants: a base model (256K context) and the full instruct model (1M context). If you want to explore other models in the same family, the multimodal models page has the full listing.

Getting Started on DeepInfra

MiMo-V2.5 is available as a public endpoint on DeepInfra, with private endpoint deployment also supported. Pricing is straightforward: Standard tier runs $0.40/1M input tokens and $2.00/1M output tokens, with cached input at $0.08/1M tokens. Priority tier — which reduces queuing time under load — scales to $0.60/1M input and $3.00/1M output ($0.12/1M cached). The platform-listed context window is 262,144 tokens. The model supports JSON mode, function calling, and multimodal inputs (text, image, video, and audio), and is available in English and Chinese.

DeepInfra exposes an OpenAI-compatible API with usage-based billing — swap in the base URL and your token, and it drops into any existing OpenAI client without further changes. The platform operates under a zero-data-retention policy and is SOC 2 and ISO 27001 certified.

The MiMo-V2.5 API reference covers the full parameter set, but here’s everything you need for a first call:

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
      "model": "XiaomiMiMo/MiMo-V2.5",
      "messages": [
        {
          "role": "user",
          "content": "Walk me through how you would approach debugging a memory leak in a long-running Node.js service."
        }
      ]
    }'

from openai import OpenAI


client = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)


response = client.chat.completions.create(
    model="XiaomiMiMo/MiMo-V2.5",
    messages=[
        {
            "role": "user",
            "content": "Walk me through how you would approach debugging a memory leak in a long-running Node.js service."
        }
    ],
)
print(response.choices[0].message.content)

import OpenAI from "openai";


const openai = new OpenAI({
  apiKey: "$DEEPINFRA_TOKEN",
  baseURL: "https://api.deepinfra.com/v1/openai",
});


const response = await openai.chat.completions.create({
  model: "XiaomiMiMo/MiMo-V2.5",
  messages: [
    {
      role: "user",
      content: "Walk me through how you would approach debugging a memory leak in a long-running Node.js service.",
    },
  ],
});
console.log(response.choices[0].message.content);copy

MiMo-V2.5-Pro and TTS Variants

Beyond the base omnimodal model, the MiMo-V2.5 family includes a few other variants worth knowing about.

MiMo-V2.5-Pro is a dedicated MoE language model with 1.02T total parameters and 42B active parameters — larger and more capable on pure text and reasoning tasks than the base V2.5 model, and a reasonable choice when you don’t need multimodal inputs but want maximum headroom on agentic workloads. The MiMo-V2.5-Pro API reference has the full parameter documentation.

On the speech side, the family includes two TTS models. MiMo-V2.5-tts converts text to natural speech with configurable output parameters, and MiMo-V2.5-tts-voiceclone extends that with voice cloning — useful if you need consistent speaker identity across generated audio. Both are available as endpoints on DeepInfra, and you can test them directly from the MiMo-V2.5-tts-voiceclone demo page before wiring them into a pipeline.

Wrapping Up

MiMo-V2.5 is a concrete step toward consolidating the fragmented model stack that agentic development has required until now — one model handling code, reasoning, vision, audio, and long context, with an architecture efficient enough to make it practical at scale. The MIT license removes friction around commercial deployment, and the benchmark numbers suggest the capability consolidation hasn’t come at the cost of quality on either the agentic or multimodal side.

For teams building agents that need to perceive, reason, and act across diverse input types — automated code review pipelines, multi-step document analysis, tools that ingest screen recordings alongside text — this is a more coherent architecture than stitching together purpose-built models. Head to the MiMo-V2.5 model page to get your API key and start building.

Kimi K2.5 API Benchmarks: Latency, Throughput & CostAbout Kimi K2.5 Kimi K2.5 is Moonshot AI’s flagship open-source reasoning model, released in January 2026. It is a native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens. The model features a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion activated parameters. Kimi K2.5 […]

OpenClaw Cost Optimization: Cut AI API Costs by 90%A single ask in an OpenClaw session can cost more than a full evening of casual ChatGPT use. Ask your agent something simple, like which calendar event clashes with your flight, and the request that hits the API carries far more than your 12-token question. It also carries your SOUL.md, the tool schemas registered on […]

Open vs Closed Source AI Models: Intelligence, Price & Speed ComparedThe LLM landscape in 2026 looks nothing like it did two years ago. Back then the assumption was simple: if you wanted the best model, you paid OpenAI or Anthropic, and that was that. Open source models were a respectable second tier, good for experimentation, fine-tuning, and budget workloads, but not quite there for serious […]

View all