Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.
Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.
You can use cURL or any other http client to run inferences:
curl -X POST \
-H "Authorization: bearer $DEEPINFRA_TOKEN" \
-F audio=@my_voice.mp3 \
'https://api.deepinfra.com/v1/inference/openai/whisper-large-v3'
which will give you back something similar to:
{
"text": "",
"segments": [
{
"id": 0,
"text": "Hello",
"start": 0.0,
"end": 1.0
},
{
"id": 1,
"text": "World",
"start": 4.0,
"end": 5.0
}
],
"language": "en",
"input_length_ms": 0,
"request_id": null,
"inference_status": {
"status": "unknown",
"runtime_ms": 0,
"cost": 0.0,
"tokens_generated": 0,
"tokens_input": 0
}
}
language
stringlanguage that the audio is in; uses detected language if None; use two letter language code (ISO 639-1) (e.g. en, de, ja)
chunk_level
stringchunk level, either 'segment' or 'word'
Default value: "segment"
Allowed values: segment
word
chunk_length_s
integerchunk length in seconds to split audio
Default value: 30
Range: 1 ≤ chunk_length_s ≤ 30
webhook
fileThe webhook to call when inference is done, by default you will get the output in the response of your inference request