We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

inworld-ai logo

inworld-ai/

realtime-tts-2

Partner

$35.00

/ 1M characters

Realtime TTS 2.0 is a low-latency text-to-speech model with natural language steering, allowing you to control tone and emotion directly in the prompt (e.g., “[be happy and upbeat] Hello!”). It supports cross-lingual voices and multiple languages, enabling the same voice to speak consistently across different languages. This is an early access preview ahead of full launch, with ongoing improvements to voice quality and steering.

Public
inworld-ai/realtime-tts-2 cover image

Input

Input text

Text to convert to speech

You need to login to use this model

Login

Settings

ServiceTier

The service tier used for processing the request. When set to 'priority', the request will be processed with higher priority (only applies to models that support it).

Voice

Preset voice name (Ashley, Diego, etc.) or a voice_id from /v1/voices/add for voice cloning.

TtsResponseFormat

Select the desired format for the speech output. Supported formats include mp3, opus, flac, wav, and pcm.

Speaking rate

Speaking rate of the speech (Default: 1, 0.5 ≤ speaking_rate ≤ 1.5)

Temperature

Temperature controls variability of the speech (Default: empty, 0 ≤ temperature ≤ 2)

Sample rate

Sample rate for the output audio (Default: 24000)

Return timestamps

Whether to return word-level timestamps

Language

Language hint (e.g. "AUTO" or a specific code like "EN_US"). Supported by realtime-tts-2; ignored by 1.5 models.. (Default: empty)

Stream

Whether to stream the output

Output

Waiting for audio data... Submit request to start streaming.