sesame/csm-1b cover image
featured

sesame/csm-1b

CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.

CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.

Public
$10.00 per M characters
ProjectPaperLicense

HTTP/cURL API

You can use cURL or any other http client to run inferences:

curl -X POST \
    -H "Authorization: bearer $DEEPINFRA_TOKEN"  \
    -F 'text=The quick brown fox jumps over the lazy dog'  \
    'https://api.deepinfra.com/v1/inference/sesame/csm-1b'

which will give you back something similar to:

{
  "audio": null,
  "input_character_length": 0,
  "output_format": "",
  "words": [
    {
      "text": "Hello",
      "start": 0.0,
      "end": 1.0,
      "confidence": 0.5
    },
    {
      "text": "World",
      "start": 4.0,
      "end": 5.0,
      "confidence": 0.5
    }
  ],
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0
  }
}

Input fields

textstring

Text to convert to speech


response_formatstring

Output format for the speech

Default value: "wav"

Allowed values: mp3opusflacwavpcm


preset_voicestring

Preset voice name to use for the speech

Default value: "none"

Allowed values: conversational_aconversational_bread_speech_aread_speech_bread_speech_cread_speech_dnone


temperaturenumber

Temperature of the generation

Default value: 0.9


speaker_audiostring

Speaker audio for the speech to be synthesized


speaker_transcriptstring

Transcript of the given speaker audio. If not provided then the speaker audio will be used as is.


max_audio_length_msinteger

Maximum audio length in milliseconds

Default value: 10000


webhookfile

The webhook to call when inference is done, by default you will get the output in the response of your inference request

Input Schema

Output Schema