CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.
CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.
You can use cURL or any other http client to run inferences:
curl -X POST \
-H "Authorization: bearer $DEEPINFRA_TOKEN" \
-F 'text=The quick brown fox jumps over the lazy dog' \
'https://api.deepinfra.com/v1/inference/sesame/csm-1b'
which will give you back something similar to:
{
"audio": null,
"input_character_length": 0,
"output_format": "",
"words": [
{
"text": "Hello",
"start": 0.0,
"end": 1.0,
"confidence": 0.5
},
{
"text": "World",
"start": 4.0,
"end": 5.0,
"confidence": 0.5
}
],
"request_id": null,
"inference_status": {
"status": "unknown",
"runtime_ms": 0,
"cost": 0.0,
"tokens_generated": 0,
"tokens_input": 0
}
}
response_format
stringOutput format for the speech
Default value: "wav"
Allowed values: mp3
opus
flac
wav
pcm
preset_voice
stringPreset voice name to use for the speech
Default value: "none"
Allowed values: conversational_a
conversational_b
read_speech_a
read_speech_b
read_speech_c
read_speech_d
none
speaker_transcript
stringTranscript of the given speaker audio. If not provided then the speaker audio will be used as is.
webhook
fileThe webhook to call when inference is done, by default you will get the output in the response of your inference request