NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!
inworld-ai/
$5.00
/ 1M characters
Fast multilingual text-to-speech model by Inworld AI with 130+ preset voices across 15 languages. Supports voice cloning, word-level timestamps, and streaming. Optimized for low-latency applications with <130ms time-to-first-audio.

Input text
Text to convert to speech
Settings
ServiceTier
The service tier used for processing the request. When set to 'priority', the request will be processed with higher priority (only applies to models that support it).
Voice
Preset voice name (Ashley, Diego, etc.) or a voice_id from /v1/voices/add for voice cloning.
TtsResponseFormat
Select the desired format for the speech output. Supported formats include mp3, opus, flac, wav, and pcm.
Speaking rate
Speaking rate of the speech (Default: 1, 0.5 ≤ speaking_rate ≤ 1.5)
Temperature
Temperature controls variability of the speech (Default: empty, 0 ≤ temperature ≤ 2)
Sample rate
Sample rate for the output audio (Default: 24000)
Return timestamps
Whether to return word-level timestamps
Stream
Whether to stream the output
Waiting for audio data... Submit request to start streaming.
Inworld TTS 1.5 Mini is a fast, lightweight text-to-speech model developed by Inworld AI. It delivers natural, expressive speech across 15 languages with 130+ preset voices and support for instant voice cloning.
| Parameter | Type | Default | Description |
|---|---|---|---|
text | string | required | Text to synthesize (up to 500,000 characters) |
voice | string | "Ashley" | Voice name from 130+ presets, or a cloned voice ID |
output_format | string | "mp3" | Output audio format: mp3, wav, opus, pcm |
speaking_rate | float | 1.0 | Speed of speech (0.5–1.5) |
temperature | float | 1.1 | Controls variability in synthesis (0–2). Higher values produce more expressive speech |
sample_rate | int | 24000 | Audio sample rate: 16000, 24000, or 48000 Hz |
return_timestamps | bool | false | Return word-level timestamps in the response |
speaker_audio | binary | none | Reference audio for voice cloning (5–15 seconds) |
Ashley, Blake, Dennis, Diego, Dominus, Elizabeth, Hades, Luna, Pixie, and 120+ more across all supported languages.
$5 per 1 million input characters
© 2026 Deep Infra. All rights reserved.