We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

Qwen3-Max-Thinking state-of-the-art reasoning model at your fingertips!

Qwen logo

Qwen/

Qwen3-TTS

$20.00

/ 1M characters

Qwen3-TTS is an advanced text-to-speech model by Alibaba's Qwen team, delivering stable, expressive, and low-latency speech generation across 10 languages. Key capabilities: - 9 preset voices — Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee — covering diverse genders, ages, and accents - Voice cloning — clone any voice from a short (~3s) audio sample via the voice_id parameter - Instruction control — adjust tone, emotion, and speaking style with natural language (e.g. "speak slowly and calmly", "excited tone") - 10 languages — English, Chinese, Japanese, Korean, German, French, Russian, Spanish, Italian, Portuguese - Streaming support — real-time PCM streaming with ~97ms first-byte latency - Multiple output formats — WAV, MP3, FLAC, PCM Built on a 1.7B parameter architecture using discrete multi-codebook language modeling for end-to-end speech synthesis without cascading errors. Uses a custom 12Hz acoustic tokenizer that preserves paralinguistic information and environmental audio details.

Qwen/Qwen3-TTS cover image

Create Voice HTTP/cURL API

DeepInfra supports custom voices.

Create voice

The following creates a voice using the curl command.

curl -X POST "https://api.deepinfra.com/v1/voices/add" \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -F "audio=@hello.wav" \
  -F "name=John Doe" \
  -F "description=John Doe's voice"
copy

which will return something similar to

{
  "user_id": "gh:10000000", 
  "voice_id": "abcd1234abcd1234abcd",
  "name": "John Doe",
  "description": "John Doe's voice",
  "created_at": 1723851387,
  "updated_at": 1723851387
}
copy

Input fields

Input Schema

Output Schema