Qwen3-Max-Thinking state-of-the-art reasoning model at your fingertips!
Qwen/
$20.00
/ 1M characters
● Qwen3-TTS-VoiceDesign is a voice design variant of Qwen3-TTS by Alibaba's Qwen team. Instead of selecting from preset voices, you describe the voice you want in natural language — and the model generates speech in that voice. Key capabilities: - Natural language voice control — describe any voice with free text (e.g. "a deep male voice with a calm, authoritative presence", "a young cheerful female with a warm and friendly tone") - 10 languages — English, Chinese, Japanese, Korean, German, French, Russian, Spanish, Italian, Portuguese - Streaming support — real-time PCM streaming - Multiple output formats — WAV, MP3, FLAC, PCM Built on the same 1.7B parameter architecture as Qwen3-TTS, using discrete multi-codebook language modeling and a custom 12Hz acoustic tokenizer for high-quality end-to-end speech synthesis.

Input text
Text to convert to speech
Voice description
Natural language description of the desired voice (e.g. "A young cheerful female with a warm tone")
Settings
ServiceTier
The service tier used for processing the request. When set to 'priority', the request will be processed with higher priority (only applies to models that support it).
Qwen3TtsLanguage
Select the desired language for the speech output. Use Auto for auto-detection.
TtsResponseFormat
Select the desired format for the speech output. Supported formats include mp3, opus, flac, wav, and pcm.
Waiting for audio data... Submit request to start streaming.
license: apache-2.0 pipeline_tag: text-to-speech library_name: qwen-tts tags:
🤗 Hugging Face | 🤖 ModelScope | 📑 Blog | 📑 Paper | 💻 GitHub
We release Qwen3-TTS, a series of powerful speech generation models developed by Qwen, offering comprehensive support for voice cloning, voice design, ultra-high-quality human-like speech generation, and natural language-based voice control.
Qwen3-TTS covers 10 major languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian) as well as multiple dialectal voice profiles. Key features:
Install the qwen-tts Python package from PyPI:
pip install -U qwen-tts
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel
# Load the model
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
# Custom Voice Generation
wavs, sr = model.generate_custom_voice(
text="其实我真的有发现,我是一个特别善于观察别人情绪的人。",
language="Chinese",
speaker="Vivian",
instruct="用特别愤怒的语气说",
)
sf.write("output.wav", wavs[0], sr)
Zero-shot speech generation on the Seed-TTS test set (Word Error Rate (WER, ↓)):
| Model | test-zh | test-en |
|---|---|---|
| Qwen3-TTS-12Hz-1.7B-Base | 0.77 | 1.24 |
If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝:
@article{Qwen3-TTS,
title={Qwen3-TTS Technical Report},
author={Hangrui Hu and Xinfa Zhu and Ting He and Dake Guo and Bin Zhang and Xiong Wang and Zhifang Guo and Ziyue Jiang and Hongkun Hao and Zishan Guo and Xinyu Zhang and Pei Zhang and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
journal={arXiv preprint arXiv:2601.15621},
year={2026}
}
© 2026 Deep Infra. All rights reserved.