DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

inworld-ai/
$35.00
/ 1M characters
Realtime TTS 2.0 is a low-latency text-to-speech model with natural language steering, allowing you to control tone and emotion directly in the prompt (e.g., “[be happy and upbeat] Hello!”). It supports cross-lingual voices and multiple languages, enabling the same voice to speak consistently across different languages. This is an early access preview ahead of full launch, with ongoing improvements to voice quality and steering.

DeepInfra supports custom voices.
The following creates a voice using the curl command.
curl -X POST "https://api.deepinfra.com/v1/voices/add" \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-F "audio=@hello.wav" \
-F "name=John Doe" \
-F "description=John Doe's voice"
which will return something similar to
{
"user_id": "gh:10000000",
"voice_id": "abcd1234abcd1234abcd",
"name": "John Doe",
"description": "John Doe's voice",
"created_at": 1723851387,
"updated_at": 1723851387
}
© 2026 DeepInfra. All rights reserved.