We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

XiaomiMiMo/

MiMo-V2.5-tts

Partner

$0.00

/ 1M characters

Automatically convert input text into natural and fluent speech output. You can generate natural and vivid speech content by configuring parameters such as speech style and voice. Use the high-quality voices from the built-in voices list.

Public

api versions voice

Input

Input text

Text to convert to speech. May contain (style) and [audio tag] controls — e.g. "(Lazy)Let me sleep five more minutes..." or "(Sighing)I miss those days." See the MiMo-V2.5-TTS docs for the full tag list.

You need to login to use this model

Settings

ServiceTier

The service tier used for processing the request. When set to 'priority', the request will be processed with higher priority (only applies to models that support it).

Voice

Built-in voice name for MiMo-V2.5-TTS speech synthesis.

Style instruction

Optional natural language instruction describing the desired speaking style, emotion, role-play, dialect, etc. Sent in the `user` role; not synthesized.. (Default: empty)

TtsResponseFormat

Select the desired format for the speech output. Supported formats include mp3, opus, flac, wav, and pcm.

Stream

Whether to stream the output. The MiMo-V2.5-TTS streaming API is currently downgraded to compatibility mode and returns the full audio after generation finishes.

Output

Waiting for audio data... Submit request to start streaming.

Model Information

Free for a limited time

MiMo-V2.5-TTS Series

Speech Synthesis (Text-to-Speech) supports automatically converting input text into natural and fluent speech output. You can generate natural and vivid speech content by configuring parameters such as speech style and voice.

Core Capabilities

Out-of-the-box built-in voices: A variety of high-quality built-in voices are available for quick use without additional configuration.
Voice design and cloning: Supports voice design via text description, or replication of arbitrary voices based on audio samples.
Diverse speech styles: Supports control over speed, emotion, role-play, dialects and other styles, for more vivid and natural speech expression.

List of Supported Models

Currently, three models of the MiMo-V2.5-TTS series are supported, and the model list is as follows:

Model Name	Function	Voice	Precautions
MiMo-V2.5-TTS	Use built-in high-quality voices for speech synthesis	Use the high-quality voices from the built-in voices list	Supports singing mode; does not support voice design and voice cloning
MiMo-V2.5-TTS-VoiceDesign	Customize voice through text description	Automatically generate voices from text descriptions, without requiring presets or audio samples	Does not support singing mode, built-in voices, or voice cloning
MiMo-V2.5-TTS-VoiceClone	Replicate any voice from audio samples	Precisely replicate voices from audio samples to enable speech synthesis of any voice	Does not support singing mode, built-in voices, or voice design

Style Control

The instruction-following ability of the model is sufficient to cover the following complex controls (a single natural language instruction is sufficient to take effect):

Multi-style Switching: A single character completes the style transition from announcement → whisper → roar within the same voice segment, with a natural and unobtrusive transition.
Multi-emotion Mixing: Supports complex emotions such as "repressed anger", "smile with a sob", "gentle but tired", "gentleness in mania", etc., rather than only allowing the selection of a single emotion.
Multi-granularity control: From paragraph level (overall tone) → sentence level (rhythm) → word level (stress) → character granularity (choking, dragging, or breathy sound of a specific character), all can be specified in the instruction.

We currently offer two control methods: natural language control and tag control. The placement of the content for both methods in messages is different:

Natural Language Control → Placed in role: user's content
Audio Tag Control → Placed in role: assistant's content

Natural Language Control

Through natural language description, enable the model to understand and generate speech in the corresponding style. The content is placed in the messages field of role: user in the content field. You can directly describe the desired speech style in a single sentence.

Example:

Report good news to the leader in a brisk and upbeat tone, speaking at a slightly faster pace, with the uncontrollable excitement and a touch of pride after learning the results, and a bright and energetic voice.

Looking at the results of the just-solved difficult problem, couldn't help exclaiming in a self-satisfied and overjoyed manner, with a high-pitched and bright voice, a relatively fast speaking speed, and a tone full of confidence and disbelief.

With a bright and lively teenage voice, carrying the pride and playfulness after a successful prank, speaking at a relatively fast pace with light enunciation, and the tone slightly rising when emphasizing the bet.

On this basis, we also support a more complex and refined director mode — just like writing a script for actors, comprehensively depicting characters and voices from the three dimensions of character, scene, and guidance, based on which the model can generate more layered and performative voices.

[Character] Clearly describe the character's identity, personality traits, physical appearance and speaking habits.
[Scene] Describe what is happening at this moment, who you are talking to, and what emotional state you are in. The more specific the better — time, location, event, and the other person's reaction can all be included.
[Guidance] Similar to a director giving acting instructions to an actor: speaking speed, breath control, pauses, accents, resonance position, timbre texture, and emotional fluctuations. It can be written in detail, and the model will act according to these "stage directions".

Example:

Role: The current head of the century-old noble Cen family. Since birth, she was adopted and raised by the gatekeeper of the ancestral temple, molded into a flawless, emotionless family totem. She has long lived in seclusion and has a strong sense of class alienation towards others.

Scene: In the shadows of the ancestral hall, she watches the man who has broken through the security cordon at all costs to find her and attempts to elope with her. She will use the coldest and most rigid class barriers to strangle both the other person and the feelings that have just sprouted but are enough to start a prairie fire within herself.

Guidance:
A cold, languid yet extremely imposing deep-voiced mature woman. Her vocal tract is very relaxed, without any sign of tension, yet exuding a bone-chilling sense of oppression.

- Speed and Pauses: Extremely slow, with each word rolling on the tip of her tongue before being uttered, carrying the casual arrogance of a superior. There are extremely long, unsettling pauses between sentences.
- Breathiness and Full Voice: Most of the time, her voice has no obvious pitch fluctuations, with a heavy and hard full voice, like a calm yet cold undercurrent. However, a very slight breathy sound must be added at certain final sounds (such as "sincerity") to reveal a hint of weariness and longing that even she herself is unaware of.
- Articulation Texture: The mixed use of literary and colloquial words bears the traces of the old era, with labiodental sounds pronounced extremely lightly but extremely clearly (such as "collision" and "cheap"), making her speech both elegant and sharp, hitting home with every word.

Director Mode is suitable for scenarios with high requirements for voice performance, such as character voiceovers, film-level content generation, etc.

Audio Tag Control

By embedding style tags and audio tags in the text, fine-grained control over speech can be directly achieved. The overall style tag comes at the beginning, and fine-grained control tags can be inserted in the middle. All tag control content is placed in the messages of the role: assistant content field.

Add a start (style) tag to the target text to specify the pronunciation style of the voice. Multiple styles can be set simultaneously by placing multiple style names within the same pair of parentheses, with no restrictions on the delimiter.

Supported bracket formats: Half-width (), full-width （）, or [] can be used.

Format Example: (Style 1 Style 2)Content to be Synthesized

The following are some recommended styles, and custom styles not listed are also supported.

Precautions

To experience a better singing style, you must add the (唱歌) tag at the very beginning of the target text, with the format: (唱歌)lyrics. Lyrics are recommended to be in Chinese for better synthesis results. The identifiers within the tags support the following values, with equivalent effects:

唱歌, sing, singing

Style Type	Style Example
Basic Emotions	Happy / Sad / Angry / Fearful / Amazed / Excited / Wronged / Calm / Indifferent
Complex Emotions	Melancholy / Relieved / Helpless / Guilty / Relieved / Jealous / Tired / Apprehensive / Emotional
Overall tone	Gentle / Cold / Lively / Serious / Lazy / Playful / Deep / Capable / Sharp
Timbre Positioning	Magnetic / Mellow / Clear / Ethereal / Innocent / Old / Sweet / Hoarse / Elegant
Character Tone	Clamp voice / Big Sister voice / Shota voice / Uncle voice / Taiwanese accent
Dialect	Northeast dialect / Sichuan dialect / Henan dialect / Cantonese
Role-playing	Sun Wukong / Lin Daiyu
Singing	singing

Example:

(Sighing)After all these years, when I walked down that street again, a part of my heart suddenly felt empty.
(Lazy)Let me sleep for five more minutes... just five minutes, really, for the last time.
(Magnetic)The night is already deep, but the city is still breathing. I'm the one accompanying you tonight. Welcome to listen to <Midnight Radio>.
(Northeastern dialect)Oh my goodness, it's so cold today! You know that wind, it's whistling like a knife, cutting into your face!
(Cantonese)This is really amazing! Once you've tasted it, you won't forget!
(singing)Forgive me for my unruly and unrestrained love for freedom throughout my life, and I'm also afraid that one day I'll fall, Oh no. Abandoning ideals, anyone can do it, so how could I be afraid that one day it'll only be you and me.

On this basis, we also support inserting [audio tag] at any position in the text. Through the [audio tag], you can perform fine-grained control over the sound, precisely adjusting tone, mood, and expression style—whether it's a whisper, a hearty laugh, or a little complaint with a touch of emotion. You can also flexibly insert breathing sounds, pauses, coughs, etc., all of which can be easily achieved. The speaking speed can also be flexibly adjusted, allowing each sentence to have its proper rhythm.

Style Type	Style Example
Speech Rate and Rhythm	Inhale / Take a deep breath / Sigh / Let out a long sigh / Pant / Hold one's breath
Emotional State	nervous / scared / excited / tired / wronged / coquettish / guilty / shocked / impatient
Speech Features	Trembling / Voice trembling / Pitch change / Cracked voice / Nasal voice / Breathiness / Hoarseness
Laughing and crying tone	Smile / Chuckle / Laugh out loud / Sneer / Sob / Whimper / Choke / Wail

Example:

(nervously, takes a deep breath) Hoo... Calm down, calm down. It's just an interview... (speaking faster, muttering) I've rehearsed my self-introduction fifty times, it should be okay. Come on, you can do it... (softly) Oh, is my tie crooked?
(extremely exhausted, listless) Master... wake me up when we get there... (sighs deeply) I'll take a little nap first. This overtime has made me feel like my soul is about to scatter.
If I had... (pauses for a moment) even if I had persisted for just one more second, would the outcome have been different? (forced smile) Oh, there are no "what ifs" anymore.
(Rapid breathing due to the cold) Hoo—hoo—This, this snow in the Greater Khingan Mountains... (cough) It can literally freeze one's bones... Don't, don't stop, keep moving, move quickly.
(raising voice and shouting) Sister! This fish is fresh! Just caught this morning! Hey! You there, stop rummaging around! If you crush it, you'll have to pay for it!

Speech Synthesis Using Built-in Voices

It comes with multiple high-quality voices and can be used directly without additional configuration. Currently, only the mimo-v2.5-tts model is supported
Supports controlling the style of synthetic speech by passing natural language instructions in the user message
Supports controlling the style of synthesized speech through audio tags

Built-in Voice List

When in use, you can set the preset timbre in {"audio": {"voice": "mimo_default"}}.

Voice Name	Voice ID	Language	Gender
MiMo-默认	mimo_default	It varies depending on the deployed cluster. The default for the China cluster is `冰糖`, and the default for other clusters is `Mia`
冰糖	冰糖	Chinese	Female
茉莉	茉莉	Chinese	Female
苏打	苏打	Chinese	Male
白桦	白桦	Chinese	Male
Mia	Mia	English	Female
Chloe	Chloe	English	Female
Milo	Milo	English	Male
Dean	Dean	English	Male