We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

XiaomiMiMo/

MiMo-V2.5-tts-voicedesign

Partner

$0.00

/ 1M characters

Automatically convert input text into natural and fluent speech output. You can generate natural and vivid speech content by configuring parameters such as speech style and voice. Automatically generate voices from text descriptions, without requiring presets or audio samples.

Public

XiaomiMiMo/MiMo-V2.5-tts-voicedesign cover image

api versions voice

Input

Input text

Text to convert to speech. Should match the tone described by `voice`.

Voice description

Natural language description of the desired voice (gender/age, texture, mood, speech speed). See the MiMo-V2.5-TTS-VoiceDesign documentation for guidance on writing effective descriptions.

You need to log in to use this model

Settings

ServiceTier

The service tier used for processing the request. 'priority' processes the request with higher priority (premium rate); 'flex' processes it at lower priority for a discount, served only when spare capacity exists and may be retried/timed out under load. Both apply only to models that support the respective tier.

TtsResponseFormat

Select the desired format for the speech output. Supported formats include mp3, opus, flac, wav, and pcm.

Stream

Whether to stream the output.

Output

Waiting for audio data... Submit request to start streaming.

Model Information

Free for a limited time

MiMo-V2.5-TTS Series

Speech Synthesis (Text-to-Speech) supports automatically converting input text into natural and fluent speech output. You can generate natural and vivid speech content by configuring parameters such as speech style and voice.

Core Capabilities

Out-of-the-box built-in voices: A variety of high-quality built-in voices are available for quick use without additional configuration.
Voice design and cloning: Supports voice design via text description, or replication of arbitrary voices based on audio samples.
Diverse speech styles: Supports control over speed, emotion, role-play, dialects and other styles, for more vivid and natural speech expression.

List of Supported Models

Currently, three models of the MiMo-V2.5-TTS series are supported, and the model list is as follows:

Model Name	Function	Voice	Precautions
MiMo-V2.5-TTS	Use built-in high-quality voices for speech synthesis	Use the high-quality voices from the built-in voices list	Supports singing mode; does not support voice design and voice cloning
MiMo-V2.5-TTS-VoiceDesign	Customize voice through text description	Automatically generate voices from text descriptions, without requiring presets or audio samples	Does not support singing mode, built-in voices, or voice cloning
MiMo-V2.5-TTS-VoiceClone	Replicate any voice from audio samples	Precisely replicate voices from audio samples to enable speech synthesis of any voice	Does not support singing mode, built-in voices, or voice design

Style Control

The instruction-following ability of the model is sufficient to cover the following complex controls (a single natural language instruction is sufficient to take effect):

Multi-style Switching: A single character completes the style transition from announcement → whisper → roar within the same voice segment, with a natural and unobtrusive transition.
Multi-emotion Mixing: Supports complex emotions such as "repressed anger", "smile with a sob", "gentle but tired", "gentleness in mania", etc., rather than only allowing the selection of a single emotion.
Multi-granularity control: From paragraph level (overall tone) → sentence level (rhythm) → word level (stress) → character granularity (choking, dragging, or breathy sound of a specific character), all can be specified in the instruction.

We currently offer two control methods: natural language control and tag control. The placement of the content for both methods in messages is different:

Natural Language Control → Placed in role: user's content
Audio Tag Control → Placed in role: assistant's content

Natural Language Control

Through natural language description, enable the model to understand and generate speech in the corresponding style. The content is placed in the messages field of role: user in the content field. You can directly describe the desired speech style in a single sentence.

Example:

Report good news to the leader in a brisk and upbeat tone, speaking at a slightly faster pace, with the uncontrollable excitement and a touch of pride after learning the results, and a bright and energetic voice.

Looking at the results of the just-solved difficult problem, couldn't help exclaiming in a self-satisfied and overjoyed manner, with a high-pitched and bright voice, a relatively fast speaking speed, and a tone full of confidence and disbelief.

With a bright and lively teenage voice, carrying the pride and playfulness after a successful prank, speaking at a relatively fast pace with light enunciation, and the tone slightly rising when emphasizing the bet.

On this basis, we also support a more complex and refined director mode — just like writing a script for actors, comprehensively depicting characters and voices from the three dimensions of character, scene, and guidance, based on which the model can generate more layered and performative voices.

[Character] Clearly describe the character's identity, personality traits, physical appearance and speaking habits.
[Scene] Describe what is happening at this moment, who you are talking to, and what emotional state you are in. The more specific the better — time, location, event, and the other person's reaction can all be included.
[Guidance] Similar to a director giving acting instructions to an actor: speaking speed, breath control, pauses, accents, resonance position, timbre texture, and emotional fluctuations. It can be written in detail, and the model will act according to these "stage directions".

Example:

Role: The current head of the century-old noble Cen family. Since birth, she was adopted and raised by the gatekeeper of the ancestral temple, molded into a flawless, emotionless family totem. She has long lived in seclusion and has a strong sense of class alienation towards others.

Scene: In the shadows of the ancestral hall, she watches the man who has broken through the security cordon at all costs to find her and attempts to elope with her. She will use the coldest and most rigid class barriers to strangle both the other person and the feelings that have just sprouted but are enough to start a prairie fire within herself.

Guidance:
A cold, languid yet extremely imposing deep-voiced mature woman. Her vocal tract is very relaxed, without any sign of tension, yet exuding a bone-chilling sense of oppression.

- Speed and Pauses: Extremely slow, with each word rolling on the tip of her tongue before being uttered, carrying the casual arrogance of a superior. There are extremely long, unsettling pauses between sentences.
- Breathiness and Full Voice: Most of the time, her voice has no obvious pitch fluctuations, with a heavy and hard full voice, like a calm yet cold undercurrent. However, a very slight breathy sound must be added at certain final sounds (such as "sincerity") to reveal a hint of weariness and longing that even she herself is unaware of.
- Articulation Texture: The mixed use of literary and colloquial words bears the traces of the old era, with labiodental sounds pronounced extremely lightly but extremely clearly (such as "collision" and "cheap"), making her speech both elegant and sharp, hitting home with every word.

Director Mode is suitable for scenarios with high requirements for voice performance, such as character voiceovers, film-level content generation, etc.

Audio Tag Control

By embedding style tags and audio tags in the text, fine-grained control over speech can be directly achieved. The overall style tag comes at the beginning, and fine-grained control tags can be inserted in the middle. All tag control content is placed in the messages of the role: assistant content field.

Add a start (style) tag to the target text to specify the pronunciation style of the voice. Multiple styles can be set simultaneously by placing multiple style names within the same pair of parentheses, with no restrictions on the delimiter.

Supported bracket formats: Half-width (), full-width （）, or [] can be used.

Format Example: (Style 1 Style 2)Content to be Synthesized

The following are some recommended styles, and custom styles not listed are also supported.

Precautions

To experience a better singing style, you must add the (唱歌) tag at the very beginning of the target text, with the format: (唱歌)lyrics. Lyrics are recommended to be in Chinese for better synthesis results. The identifiers within the tags support the following values, with equivalent effects:

唱歌, sing, singing

Style Type	Style Example
Basic Emotions	Happy / Sad / Angry / Fearful / Amazed / Excited / Wronged / Calm / Indifferent
Complex Emotions	Melancholy / Relieved / Helpless / Guilty / Relieved / Jealous / Tired / Apprehensive / Emotional
Overall tone	Gentle / Cold / Lively / Serious / Lazy / Playful / Deep / Capable / Sharp
Timbre Positioning	Magnetic / Mellow / Clear / Ethereal / Innocent / Old / Sweet / Hoarse / Elegant
Character Tone	Clamp voice / Big Sister voice / Shota voice / Uncle voice / Taiwanese accent
Dialect	Northeast dialect / Sichuan dialect / Henan dialect / Cantonese
Role-playing	Sun Wukong / Lin Daiyu
Singing	singing

Example:

(Sighing)After all these years, when I walked down that street again, a part of my heart suddenly felt empty.
(Lazy)Let me sleep for five more minutes... just five minutes, really, for the last time.
(Magnetic)The night is already deep, but the city is still breathing. I'm the one accompanying you tonight. Welcome to listen to <Midnight Radio>.
(Northeastern dialect)Oh my goodness, it's so cold today! You know that wind, it's whistling like a knife, cutting into your face!
(Cantonese)This is really amazing! Once you've tasted it, you won't forget!
(singing)Forgive me for my unruly and unrestrained love for freedom throughout my life, and I'm also afraid that one day I'll fall, Oh no. Abandoning ideals, anyone can do it, so how could I be afraid that one day it'll only be you and me.

On this basis, we also support inserting [audio tag] at any position in the text. Through the [audio tag], you can perform fine-grained control over the sound, precisely adjusting tone, mood, and expression style—whether it's a whisper, a hearty laugh, or a little complaint with a touch of emotion. You can also flexibly insert breathing sounds, pauses, coughs, etc., all of which can be easily achieved. The speaking speed can also be flexibly adjusted, allowing each sentence to have its proper rhythm.

Style Type	Style Example
Speech Rate and Rhythm	Inhale / Take a deep breath / Sigh / Let out a long sigh / Pant / Hold one's breath
Emotional State	nervous / scared / excited / tired / wronged / coquettish / guilty / shocked / impatient
Speech Features	Trembling / Voice trembling / Pitch change / Cracked voice / Nasal voice / Breathiness / Hoarseness
Laughing and crying tone	Smile / Chuckle / Laugh out loud / Sneer / Sob / Whimper / Choke / Wail

Example:

(nervously, takes a deep breath) Hoo... Calm down, calm down. It's just an interview... (speaking faster, muttering) I've rehearsed my self-introduction fifty times, it should be okay. Come on, you can do it... (softly) Oh, is my tie crooked?
(extremely exhausted, listless) Master... wake me up when we get there... (sighs deeply) I'll take a little nap first. This overtime has made me feel like my soul is about to scatter.
If I had... (pauses for a moment) even if I had persisted for just one more second, would the outcome have been different? (forced smile) Oh, there are no "what ifs" anymore.
(Rapid breathing due to the cold) Hoo—hoo—This, this snow in the Greater Khingan Mountains... (cough) It can literally freeze one's bones... Don't, don't stop, keep moving, move quickly.
(raising voice and shouting) Sister! This fish is fresh! Just caught this morning! Hey! You there, stop rummaging around! If you crush it, you'll have to pay for it!

Speech Synthesis Using Voice Design

There is no need to provide an audio file. Simply add voice description text to the message with the role of user, and a customized voice can be generated. Currently, only the mimo-v2.5-tts-voicedesign model is supported.

How to Write a Good Voice Design Prompt

When using the mimo-v2.5-tts-voicedesign model, the text in the user message is the voice design description. The more specific and vivid the description, the closer the generated voice will be to the expected one.

Key Dimension

A good voice description usually covers the following multiple dimensions (not necessarily comprehensive):

Dimension	Example
Gender and Age	"young woman in her mid-20s", "middle-aged man in his 50s"
Voice / Texture	"deep and gravelly", "silky, mellow, and magnetic"
Mood / Tone	"warm and confident", "gentle but with a hint of weariness"
Speech speed / Rhythm	"slow and deliberate", "speaking at an extremely fast pace, like a machine gun."

The following dimensions can be optionally added to increase richness:

Role / Character: narrator, podcast host, storyteller, late-night radio DJ
Speaking style: casual and colloquial, seriously, lowering one's voice as if plotting
Scene description: narrating a nature documentary, during a roadshow for investors
Era reference: 1940s film noir, dubbed voices of translated films from the 1980s

Writing Suggestions

Concise descriptive -- quickly outline the sound profile using keywords or a single sentence

Heavy Russian accent, gruff middle-aged male, blunt and matter-of-fact.

Professional Descriptive -- Three-dimensional portrayal of sound through scenarios, character design, or multi-dimensional details

Young female, extreme close-up with a binaural, ear-to-ear ASMR feel. Audible breathing, subtle swallowing, and soft natural lip sounds. She speaks very slowly, creating a deeply relaxing and immersive experience.

An elderly gentleman, speaking Mandarin with a northern accent, his speech slow and steady, his voice slightly hoarse and weathered, as if an old and seasoned grandfather were telling a story, full of the wisdom of years.

Precautions

Length: 1-4 sentences are sufficient; there's no need to write a long text. Clearly describing the core features is more important than piling up dimensions
Avoid conflicts: Do not simultaneously request contradictory characteristics (e.g., "innocent childish voice + CEO aura")
Avoid using audio quality effect terms: Do not write descriptions related to post-processing such as reverb, echo, EQ, compression, etc
Avoid vague words: Do not use descriptions lacking specific references such as "ordinary," "normal," or "foreign"
Both Chinese and English are supported: the model supports both Chinese and English voice timbre descriptions, so choose the language in which you can express most precisely
Synthetic text should match the voice tone: The synthetic text in the assistant message should match the voice tone description to achieve the best results. For example, pair a goodnight monologue with a "gentle and soothing female voice" instead of a passionate sports commentary. It is recommended to use LLM to automatically generate matching synthetic text based on your voice tone description; on the Studio page, you can directly click the "Generate Text" button after entering the voice tone description.