FLUX.2 is live! High-fidelity image generation made simple.

ResembleAI/
$1.00
/ 1M characters
Chatterbox is a family of three state-of-the-art, open-source text-to-speech models by Resemble AI. We are excited to introduce Chatterbox-Turbo, our most efficient model yet. Built on a streamlined 350M parameter architecture, Turbo delivers high-quality speech with less compute and VRAM than our previous models. We have also distilled the speech-token-to-mel decoder, previously a bottleneck, reducing generation from 10 steps to just one, while retaining high-fidelity audio output. Paralinguistic tags are now native to the Turbo model, allowing you to use [cough], [laugh], [chuckle], and more to add distinct realism. While Turbo was built primarily for low-latency voice agents, it excels at narration and creative workflows. If you like the model but need to scale or tune it for higher accuracy, check out our competitively priced TTS service (link). It delivers reliable performance with ultra-low latency of sub 200ms—ideal for production use in agents, applications, or interactive media.

Input text
Text to convert to speech
Voice ID
Voice ID created on deepinfra. (Default: empty)
Settings
Exaggeration
Exaggeration factor for the speech (Default: 0, 0 ≤ exaggeration ≤ 1)
CFG
CFG factor for the speech (Default: 0, 0 ≤ cfg ≤ 1)
Temperature
Temperature for the speech (Default: 0.8, 0 ≤ temperature ≤ 2)
Seed
Seed for the random number generator (Default: empty, 0 ≤ seed ≤ 2147483647)
Top P
Top P for the speech (Default: 0.95, 0 ≤ top_p ≤ 1)
Min P
Min P for the speech (Default: 0, 0 ≤ min_p ≤ 1)
Repetition Penalty
Repetition Penalty for the speech (Default: 1.2, 0 ≤ repetition_penalty ≤ 5)
Top K
Top K for the speech (Default: 1000, 0 ≤ top_k ≤ 1000)
Waiting for audio data... Submit request to start streaming.
© 2025 Deep Infra. All rights reserved.