Dia directly generates highly realistic dialogue from a transcript. You can condition the output on audio, enabling emotion and tone control. The model can also produce nonverbal communications like laughter, coughing, clearing throat, etc.
Dia directly generates highly realistic dialogue from a transcript. You can condition the output on audio, enabling emotion and tone control. The model can also produce nonverbal communications like laughter, coughing, clearing throat, etc.
Text to convert to speech
Please upload an audio file
Transcript of the given speaker audio. If not provided then the speaker audio will be used as is.. (Default: empty)
Controls the maximum length of the generated audio (more tokens = longer audio). (Default: 3072, 500 ≤ max_new_tokens ≤ 4096)
Cfg Scale
Higher values increase adherence to the text prompt. (Default: 3, 1 ≤ cfg_scale ≤ 5)
Temperature
Lower values make the output more deterministic, higher values increase randomness. (Default: 1.3, 1 ≤ temperature ≤ 1.5)
Top P
Filters vocabulary to the most likely tokens cumulatively reaching probability P. (Default: 0.95, 0.8 ≤ top_p ≤ 1)
Cfg Filter Top K
Top k filter for CFG guidance. (Default: 35, 15 ≤ cfg_filter_top_k ≤ 50)
Speed
Adjusts the speed of the generated audio (1.0 = original speed). (Default: 0.94, 0.8 ≤ speed ≤ 1)
Waiting for audio data... Submit request to start streaming.
license: apache-2.0 pipeline_tag: text-to-speech language:
Dia is a 1.6B parameter text to speech model created by Nari Labs. It was pushed to the Hub using the PytorchModelHubMixin integration.
Dia directly generates highly realistic dialogue from a transcript. You can condition the output on audio, enabling emotion and tone control. The model can also produce nonverbal communications like laughter, coughing, clearing throat, etc.
To accelerate research, we are providing access to pretrained model checkpoints and inference code. The model weights are hosted on Hugging Face. The model only supports English generation at the moment.
We also provide a demo page comparing our model to ElevenLabs Studio and Sesame CSM-1B.
This will open a Gradio UI that you can work on.
git clone https://github.com/nari-labs/dia.git
cd dia && uv run app.py
or if you do not have uv
pre-installed:
git clone https://github.com/nari-labs/dia.git
cd dia
python -m venv .venv
source .venv/bin/activate
pip install uv
uv run app.py
Note that the model was not fine-tuned on a specific voice. Hence, you will get different voices every time you run the model. You can keep speaker consistency by either adding an audio prompt (a guide coming VERY soon - try it with the second example on Gradio for now), or fixing the seed.
[S1]
and [S2]
tag(laughs)
, (coughs)
, etc.
(laughs), (clears throat), (sighs), (gasps), (coughs), (singing), (sings), (mumbles), (beep), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (burps), (humming), (sneezes), (chuckle), (whistles)
example/voice_clone.py
for more information.
import soundfile as sf
from dia.model import Dia
model = Dia.from_pretrained("nari-labs/Dia-1.6B")
text = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face."
output = model.generate(text)
sf.write("simple.mp3", output, 44100)
A pypi package and a working CLI tool will be available soon.
Dia has been tested on only GPUs (pytorch 2.0+, CUDA 12.6). CPU support is to be added soon. The initial run will take longer as the Descript Audio Codec also needs to be downloaded.
On enterprise GPUs, Dia can generate audio in real-time. On older GPUs, inference time will be slower.
For reference, on a A4000 GPU, Dia roughly generates 40 tokens/s (86 tokens equals 1 second of audio).
torch.compile
will increase speeds for supported GPUs.
The full version of Dia requires around 10GB of VRAM to run. We will be adding a quantized version in the future.
If you don't have hardware available or if you want to play with bigger versions of our models, join the waitlist here.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
This project offers a high-fidelity speech generation model intended for research and educational use. The following uses are strictly forbidden:
By using this model, you agree to uphold relevant legal standards and ethical responsibilities. We are not responsible for any misuse and firmly oppose any unethical usage of this technology.
We are a tiny team of 1 full-time and 1 part-time research-engineers. We are extra-welcome to any contributions! Join our Discord Server for discussions.