We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…
sesame/csm-1b cover image
featured

sesame/csm-1b

CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.

CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.

Public
$7.00 per M characters
ProjectPaperLicense

HTTP/cURL API

You can use cURL or any other http client to run inferences:

curl -X POST \
    -d '{"text": "The quick brown fox jumps over the lazy dog"}'  \
    -H "Authorization: bearer $DEEPINFRA_TOKEN"  \
    -H 'Content-Type: application/json'  \
    'https://api.deepinfra.com/v1/inference/sesame/csm-1b'
copy

which will give you back something similar to:

{
  "audio": null,
  "input_character_length": 0,
  "output_format": "",
  "words": [
    {
      "end": 1.0,
      "start": 0.0,
      "text": "Hello"
    },
    {
      "end": 5.0,
      "start": 4.0,
      "text": "World"
    }
  ],
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0
  }
}

copy

Input fields

textstring

Text to convert to speech


response_formatstring

Output format for the speech

Default value: "wav"

Allowed values: mp3opusflacwavpcm


preset_voicestring

Preset voice name to use for the speech

Default value: "none"

Allowed values: conversational_aconversational_bread_speech_aread_speech_bread_speech_cread_speech_dnone


temperaturenumber

Temperature of the generation

Default value: 0.9


speaker_audiostring

Speaker audio for the speech to be synthesized


speaker_transcriptstring

Transcript of the given speaker audio. If not provided then the speaker audio will be used as is.


max_audio_length_msinteger

Maximum audio length in milliseconds

Default value: 10000


streamboolean

Whether to stream audio bytes in chunks

Default value: false


webhookfile

The webhook to call when inference is done, by default you will get the output in the response of your inference request

Input Schema

Output Schema

Unlock the most affordable AI hosting

Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.