Browse deepinfra models:

All categories and models you can try out and directly use in deepinfra:
Search

Category/text-to-speech

hexgrad/Kokoro-82M cover image
featured
$0.80 per M characters
  • text-to-speech

Kokoro is a frontier TTS model for its size of 82 million parameters (text in/audio out). On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision under an Apache 2.0 license. As of 2 Jan 2025, 10 unique Voicepacks have been released, and a .onnx version of v0.19 is available.

Zyphra/Zonos-v0.1-hybrid cover image
$7.00 per M characters
  • text-to-speech

Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers. Our model enables highly natural speech generation from text prompts when given a speaker embedding or audio prefix, and can accurately perform speech cloning when given a reference clip spanning just a few seconds. The conditioning setup also allows for fine control over speaking rate, pitch variation, audio quality, and emotions such as happiness, fear, sadness, and anger. The model outputs speech natively at 44kHz.

Zyphra/Zonos-v0.1-transformer cover image
$7.00 per M characters
  • text-to-speech

Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers. Our model enables highly natural speech generation from text prompts when given a speaker embedding or audio prefix, and can accurately perform speech cloning when given a reference clip spanning just a few seconds. The conditioning setup also allows for fine control over speaking rate, pitch variation, audio quality, and emotions such as happiness, fear, sadness, and anger. The model outputs speech natively at 44kHz.

deepinfra/tts cover image
$5.00 per M characters
  • text-to-speech

Text-to-Speech (TTS) technology converts written text into spoken words using advanced speech synthesis. TTS systems are used in applications like virtual assistants, accessibility tools for visually impaired users, and language learning software, enabling seamless human-computer interaction.