Qwen3-Max-Thinking state-of-the-art reasoning model at your fingertips!
Qwen/
$20.00
/ 1M characters
Qwen3-TTS is an advanced text-to-speech model by Alibaba's Qwen team, delivering stable, expressive, and low-latency speech generation across 10 languages. Key capabilities: - 9 preset voices — Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee — covering diverse genders, ages, and accents - Voice cloning — clone any voice from a short (~3s) audio sample via the voice_id parameter - Instruction control — adjust tone, emotion, and speaking style with natural language (e.g. "speak slowly and calmly", "excited tone") - 10 languages — English, Chinese, Japanese, Korean, German, French, Russian, Spanish, Italian, Portuguese - Streaming support — real-time PCM streaming with ~97ms first-byte latency - Multiple output formats — WAV, MP3, FLAC, PCM Built on a 1.7B parameter architecture using discrete multi-codebook language modeling for end-to-end speech synthesis without cascading errors. Uses a custom 12Hz acoustic tokenizer that preserves paralinguistic information and environmental audio details.

DeepInfra supports custom voices.
The following creates a voice using the curl command.
curl -X POST "https://api.deepinfra.com/v1/voices/add" \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-F "audio=@hello.wav" \
-F "name=John Doe" \
-F "description=John Doe's voice"
which will return something similar to
{
"user_id": "gh:10000000",
"voice_id": "abcd1234abcd1234abcd",
"name": "John Doe",
"description": "John Doe's voice",
"created_at": 1723851387,
"updated_at": 1723851387
}
© 2026 Deep Infra. All rights reserved.