We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

🚀 New models by Bria.ai, generate and edit images at scale 🚀

hexgrad logo

hexgrad/

Kokoro-82M

$0.62

/ 1M characters

Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.

hexgrad/Kokoro-82M cover image

Input

Input text

Text to convert to speech

Settings

TtsResponseFormat

Select the desired format for the speech output. Supported formats include mp3, opus, flac, wav, and pcm.

af_bella

Select the desired voice for the speech output. You can select multiple to combine and mix voices.

Speed

Speed of the speech (Default: empty, 0.25 ≤ speed ≤ 4)

Stream

Whether to stream the output

Return Timestamps

Whether to return timestamps

Sample Rate

Sample rate for the output audio. (Default: empty)

Target Min Tokens

Minimum number of tokens for the output. (Default: empty)

Target Max Tokens

Maximum number of tokens for the output. (Default: empty)

Absolute Max Tokens

Absolute maximum number of tokens for the output. (Default: empty)

Output

Waiting for audio data... Submit request to start streaming.

Model Information

license: apache-2.0 language:

  • en base_model:
  • yl4579/StyleTTS2-LJSpeech pipeline_tag: text-to-speech

🐈 GitHub: https://github.com/hexgrad/kokoro

🚀 Demo: https://hf.co/spaces/hexgrad/Kokoro-TTS

Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.

Releases

ModelPublishedTraining DataLangs & VoicesSHA256
v1.02025 Jan 27Few hundred hrs8 & 54496dba11
v0.192024 Dec 25<100 hrs1 & 103b0c392f
Training Costsv0.19v1.0Total
in A100 80GB GPU hours5005001000
average hourly rate$0.80/h$1.20/h$1/h
in USD$400$600$1000

Usage

You can run this basic cell on Google Colab. Listen to samples. For more languages and details, see Advanced Usage.

!pip install -q kokoro>=0.9.2 soundfile
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
import torch
pipeline = KPipeline(lang_code='a')
text = '''
[Kokoro](/kˈOkəɹO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kˈOkəɹO/) can be deployed anywhere from production environments to personal projects.
'''
generator = pipeline(text, voice='af_heart')
for i, (gs, ps, audio) in enumerate(generator):
    print(i, gs, ps)
    display(Audio(data=audio, rate=24000, autoplay=i==0))
    sf.write(f'{i}.wav', audio, 24000)
copy

Under the hood, kokoro uses misaki, a G2P library at https://github.com/hexgrad/misaki

Model Facts

Architecture:

Architected by: Li et al @ https://github.com/yl4579/StyleTTS2

Trained by: @rzvzn on Discord

Languages: Multiple

Model SHA256 Hash: 496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4

Training Details

Data: Kokoro was trained exclusively on permissive/non-copyrighted audio data and IPA phoneme labels. Examples of permissive/non-copyrighted audio include:

  • Public domain audio
  • Audio licensed under Apache, MIT, etc
  • Synthetic audio[1] generated by closed[2] TTS models from large providers
    [1] https://copyright.gov/ai/ai_policy_guidance.pdf
    [2] No synthetic audio from open TTS models or "custom voice clones"

Total Dataset Size: A few hundred hours of audio

Total Training Cost: About $1000 for 1000 hours of A100 80GB vRAM

Creative Commons Attribution

The following CC BY audio was part of the dataset used to train Kokoro v1.0.

Audio DataDuration UsedLicenseAdded to Training Set After
Koniwa tnc<1hCC BY 3.0v0.19 / 22 Nov 2024
SIWIS<11hCC BY 4.0v0.19 / 22 Nov 2024

Acknowledgements

  • 🛠️ @yl4579 for architecting StyleTTS 2.
  • 🏆 @Pendrokar for adding Kokoro as a contender in the TTS Spaces Arena.
  • 📊 Thank you to everyone who contributed synthetic training data.
  • ❤️ Special thanks to all compute sponsors.
  • 👾 Discord server: https://discord.gg/QuGxSWBfQy
  • 🪽 Kokoro is a Japanese word that translates to "heart" or "spirit". Kokoro is also the name of an AI in the Terminator franchise.
kokoro