DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

inworld-ai/
$35.00
/ 1M characters
Realtime TTS 2.0 is a low-latency text-to-speech model with natural language steering, allowing you to control tone and emotion directly in the prompt (e.g., “[be happy and upbeat] Hello!”). It supports cross-lingual voices and multiple languages, enabling the same voice to speak consistently across different languages. This is an early access preview ahead of full launch, with ongoing improvements to voice quality and steering.

Input text
Text to convert to speech
You need to login to use this model
LoginSettings
ServiceTier
The service tier used for processing the request. When set to 'priority', the request will be processed with higher priority (only applies to models that support it).
Voice
Preset voice name (Ashley, Diego, etc.) or a voice_id from /v1/voices/add for voice cloning.
TtsResponseFormat
Select the desired format for the speech output. Supported formats include mp3, opus, flac, wav, and pcm.
Speaking rate
Speaking rate of the speech (Default: 1, 0.5 ≤ speaking_rate ≤ 1.5)
Temperature
Temperature controls variability of the speech (Default: empty, 0 ≤ temperature ≤ 2)
Sample rate
Sample rate for the output audio (Default: 24000)
Return timestamps
Whether to return word-level timestamps
Language
Language hint (e.g. "AUTO" or a specific code like "EN_US"). Supported by realtime-tts-2; ignored by 1.5 models.. (Default: empty)
Stream
Whether to stream the output
Waiting for audio data... Submit request to start streaming.
© 2026 DeepInfra. All rights reserved.