CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.
CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.
Text to convert to speech
Select the desired format for the speech output. Supported formats include mp3, opus, flac, wav, and pcm. 5
Select the desired voice for the speech output. 7
Temperature of the generation (Default: 0.9)
Please upload an audio file
Transcript of the given speaker audio. If not provided then the speaker audio will be used as is.. (Default: empty)
Maximum audio length in milliseconds (Default: 10000)
2025/03/13 - We are releasing the 1B CSM variant. The checkpoint is hosted on Hugging Face.
CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.
A fine-tuned variant of CSM powers the interactive voice demo shown in our blog post.
A hosted Hugging Face space is also available for testing audio generation.
ffmpeg
may be requiredgit clone git@github.com:SesameAILabs/csm.git
cd csm
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# You will need access to CSM-1B and Llama-3.2-1B
huggingface-cli login
The triton
package cannot be installed in Windows. Instead use pip install triton-windows
.
Generate a sentence
from generator import load_csm_1b
import torchaudio
import torch
if torch.backends.mps.is_available():
device = "mps"
elif torch.cuda.is_available():
device = "cuda"
else:
device = "cpu"
generator = load_csm_1b(device=device)
audio = generator.generate(
text="Hello from Sesame.",
speaker=0,
context=[],
max_audio_length_ms=10_000,
)
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
CSM sounds best when provided with context. You can prompt or provide context to the model using a Segment
for each speaker's utterance.
speakers = [0, 1, 0, 0]
transcripts = [
"Hey how are you doing.",
"Pretty good, pretty good.",
"I'm great.",
"So happy to be speaking to you.",
]
audio_paths = [
"utterance_0.wav",
"utterance_1.wav",
"utterance_2.wav",
"utterance_3.wav",
]
def load_audio(audio_path):
audio_tensor, sample_rate = torchaudio.load(audio_path)
audio_tensor = torchaudio.functional.resample(
audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
)
return audio_tensor
segments = [
Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
]
audio = generator.generate(
text="Me too, this is some cool stuff huh?",
speaker=1,
context=segments,
max_audio_length_ms=10_000,
)
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
Does this model come with any voices?
The model open-sourced here is a base generation model. It is capable of producing a variety of voices, but it has not been fine-tuned on any specific voice.
Can I converse with the model?
CSM is trained to be an audio generation model and not a general-purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation.
Does it support other languages?
The model has some capacity for non-English languages due to data contamination in the training data, but it likely won't do well.
This project provides a high-quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we explicitly prohibit the following:
By using this model, you agree to comply with all applicable laws and ethical guidelines. We are not responsible for any misuse, and we strongly condemn unethical applications of this technology.
Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.