DeepInfra raises $107M Series B to scale the inference cloud — read the announcement
XiaomiMiMo/
$0.00
/ 1M characters
Automatically convert input text into natural and fluent speech output. You can generate natural and vivid speech content by configuring parameters such as speech style and voice. Automatically generate voices from text descriptions, without requiring presets or audio samples.

Input text
Text to convert to speech. Should match the tone described by `voice`.
Voice description
Natural language description of the desired voice (gender/age, texture, mood, speech speed). See the MiMo-V2.5-TTS-VoiceDesign documentation for guidance on writing effective descriptions.
You need to login to use this model
LoginSettings
ServiceTier
The service tier used for processing the request. When set to 'priority', the request will be processed with higher priority (only applies to models that support it).
TtsResponseFormat
Select the desired format for the speech output. Supported formats include mp3, opus, flac, wav, and pcm.
Stream
Whether to stream the output.
Waiting for audio data... Submit request to start streaming.
Free for a limited time
Speech Synthesis (Text-to-Speech) supports automatically converting input text into natural and fluent speech output. You can generate natural and vivid speech content by configuring parameters such as speech style and voice.
Core Capabilities
Currently, three models of the MiMo-V2.5-TTS series are supported, and the model list is as follows:
| Model Name | Function | Voice | Precautions |
|---|---|---|---|
| MiMo-V2.5-TTS | Use built-in high-quality voices for speech synthesis | Use the high-quality voices from the built-in voices list | Supports singing mode; does not support voice design and voice cloning |
| MiMo-V2.5-TTS-VoiceDesign | Customize voice through text description | Automatically generate voices from text descriptions, without requiring presets or audio samples | Does not support singing mode, built-in voices, or voice cloning |
| MiMo-V2.5-TTS-VoiceClone | Replicate any voice from audio samples | Precisely replicate voices from audio samples to enable speech synthesis of any voice | Does not support singing mode, built-in voices, or voice design |
The instruction-following ability of the model is sufficient to cover the following complex controls (a single natural language instruction is sufficient to take effect):
We currently offer two control methods: natural language control and tag control. The placement of the content for both methods in messages is different:
role: user's contentrole: assistant's contentThrough natural language description, enable the model to understand and generate speech in the corresponding style. The content is placed in the messages field of role: user in the content field. You can directly describe the desired speech style in a single sentence.
Example:
Report good news to the leader in a brisk and upbeat tone, speaking at a slightly faster pace, with the uncontrollable excitement and a touch of pride after learning the results, and a bright and energetic voice.
Looking at the results of the just-solved difficult problem, couldn't help exclaiming in a self-satisfied and overjoyed manner, with a high-pitched and bright voice, a relatively fast speaking speed, and a tone full of confidence and disbelief.
With a bright and lively teenage voice, carrying the pride and playfulness after a successful prank, speaking at a relatively fast pace with light enunciation, and the tone slightly rising when emphasizing the bet.
On this basis, we also support a more complex and refined director mode — just like writing a script for actors, comprehensively depicting characters and voices from the three dimensions of character, scene, and guidance, based on which the model can generate more layered and performative voices.
Example:
Role: The current head of the century-old noble Cen family. Since birth, she was adopted and raised by the gatekeeper of the ancestral temple, molded into a flawless, emotionless family totem. She has long lived in seclusion and has a strong sense of class alienation towards others.
Scene: In the shadows of the ancestral hall, she watches the man who has broken through the security cordon at all costs to find her and attempts to elope with her. She will use the coldest and most rigid class barriers to strangle both the other person and the feelings that have just sprouted but are enough to start a prairie fire within herself.
Guidance:
A cold, languid yet extremely imposing deep-voiced mature woman. Her vocal tract is very relaxed, without any sign of tension, yet exuding a bone-chilling sense of oppression.
- Speed and Pauses: Extremely slow, with each word rolling on the tip of her tongue before being uttered, carrying the casual arrogance of a superior. There are extremely long, unsettling pauses between sentences.
- Breathiness and Full Voice: Most of the time, her voice has no obvious pitch fluctuations, with a heavy and hard full voice, like a calm yet cold undercurrent. However, a very slight breathy sound must be added at certain final sounds (such as "sincerity") to reveal a hint of weariness and longing that even she herself is unaware of.
- Articulation Texture: The mixed use of literary and colloquial words bears the traces of the old era, with labiodental sounds pronounced extremely lightly but extremely clearly (such as "collision" and "cheap"), making her speech both elegant and sharp, hitting home with every word.
Director Mode is suitable for scenarios with high requirements for voice performance, such as character voiceovers, film-level content generation, etc.
By embedding style tags and audio tags in the text, fine-grained control over speech can be directly achieved. The overall style tag comes at the beginning, and fine-grained control tags can be inserted in the middle. All tag control content is placed in the messages of the role: assistant content field.
Add a start (style) tag to the target text to specify the pronunciation style of the voice. Multiple styles can be set simultaneously by placing multiple style names within the same pair of parentheses, with no restrictions on the delimiter.
Supported bracket formats: Half-width (), full-width (), or [] can be used.
Format Example: (Style 1 Style 2)Content to be Synthesized
The following are some recommended styles, and custom styles not listed are also supported.
Precautions
- To experience a better singing style, you must add the
(唱歌)tag at the very beginning of the target text, with the format:(唱歌)lyrics.Lyricsare recommended to be in Chinese for better synthesis results. The identifiers within the tags support the following values, with equivalent effects:唱歌,sing,singing
| Style Type | Style Example |
|---|---|
| Basic Emotions | Happy / Sad / Angry / Fearful / Amazed / Excited / Wronged / Calm / Indifferent |
| Complex Emotions | Melancholy / Relieved / Helpless / Guilty / Relieved / Jealous / Tired / Apprehensive / Emotional |
| Overall tone | Gentle / Cold / Lively / Serious / Lazy / Playful / Deep / Capable / Sharp |
| Timbre Positioning | Magnetic / Mellow / Clear / Ethereal / Innocent / Old / Sweet / Hoarse / Elegant |
| Character Tone | Clamp voice / Big Sister voice / Shota voice / Uncle voice / Taiwanese accent |
| Dialect | Northeast dialect / Sichuan dialect / Henan dialect / Cantonese |
| Role-playing | Sun Wukong / Lin Daiyu |
| Singing | singing |
Example:
(Sighing)After all these years, when I walked down that street again, a part of my heart suddenly felt empty.(Lazy)Let me sleep for five more minutes... just five minutes, really, for the last time.(Magnetic)The night is already deep, but the city is still breathing. I'm the one accompanying you tonight. Welcome to listen to <Midnight Radio>.(Northeastern dialect)Oh my goodness, it's so cold today! You know that wind, it's whistling like a knife, cutting into your face!(Cantonese)This is really amazing! Once you've tasted it, you won't forget!(singing)Forgive me for my unruly and unrestrained love for freedom throughout my life, and I'm also afraid that one day I'll fall, Oh no. Abandoning ideals, anyone can do it, so how could I be afraid that one day it'll only be you and me.On this basis, we also support inserting [audio tag] at any position in the text. Through the [audio tag], you can perform fine-grained control over the sound, precisely adjusting tone, mood, and expression style—whether it's a whisper, a hearty laugh, or a little complaint with a touch of emotion. You can also flexibly insert breathing sounds, pauses, coughs, etc., all of which can be easily achieved. The speaking speed can also be flexibly adjusted, allowing each sentence to have its proper rhythm.
| Style Type | Style Example |
|---|---|
| Speech Rate and Rhythm | Inhale / Take a deep breath / Sigh / Let out a long sigh / Pant / Hold one's breath |
| Emotional State | nervous / scared / excited / tired / wronged / coquettish / guilty / shocked / impatient |
| Speech Features | Trembling / Voice trembling / Pitch change / Cracked voice / Nasal voice / Breathiness / Hoarseness |
| Laughing and crying tone | Smile / Chuckle / Laugh out loud / Sneer / Sob / Whimper / Choke / Wail |
Example:
There is no need to provide an audio file. Simply add voice description text to the message with the role of user, and a customized voice can be generated. Currently, only the mimo-v2.5-tts-voicedesign model is supported.
When using the mimo-v2.5-tts-voicedesign model, the text in the user message is the voice design description. The more specific and vivid the description, the closer the generated voice will be to the expected one.
A good voice description usually covers the following multiple dimensions (not necessarily comprehensive):
| Dimension | Example |
|---|---|
| Gender and Age | "young woman in her mid-20s", "middle-aged man in his 50s" |
| Voice / Texture | "deep and gravelly", "silky, mellow, and magnetic" |
| Mood / Tone | "warm and confident", "gentle but with a hint of weariness" |
| Speech speed / Rhythm | "slow and deliberate", "speaking at an extremely fast pace, like a machine gun." |
The following dimensions can be optionally added to increase richness:
Concise descriptive -- quickly outline the sound profile using keywords or a single sentence
Heavy Russian accent, gruff middle-aged male, blunt and matter-of-fact.
Professional Descriptive -- Three-dimensional portrayal of sound through scenarios, character design, or multi-dimensional details
Young female, extreme close-up with a binaural, ear-to-ear ASMR feel. Audible breathing, subtle swallowing, and soft natural lip sounds. She speaks very slowly, creating a deeply relaxing and immersive experience.
An elderly gentleman, speaking Mandarin with a northern accent, his speech slow and steady, his voice slightly hoarse and weathered, as if an old and seasoned grandfather were telling a story, full of the wisdom of years.
assistant message should match the voice tone description to achieve the best results. For example, pair a goodnight monologue with a "gentle and soothing female voice" instead of a passionate sports commentary. It is recommended to use LLM to automatically generate matching synthetic text based on your voice tone description; on the Studio page, you can directly click the "Generate Text" button after entering the voice tone description.© 2026 DeepInfra. All rights reserved.