Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It was trained on 680k hours of labelled data and demonstrates a strong ability to generalize to many datasets and domains without the need for fine-tuning. The model is based on a Transformer architecture and uses a large-scale weak supervision technique.
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It was trained on 680k hours of labelled data and demonstrates a strong ability to generalize to many datasets and domains without the need for fine-tuning. The model is based on a Transformer architecture and uses a large-scale weak supervision technique.
You can use cURL or any other http client to run inferences:
curl -X POST \
-H "Authorization: bearer $DEEPINFRA_TOKEN" \
-F audio=@my_voice.mp3 \
'https://api.deepinfra.com/v1/inference/openai/whisper-small'
which will give you back something similar to:
{
"text": "",
"segments": [
{
"id": 0,
"text": "Hello",
"start": 0.0,
"end": 1.0
},
{
"id": 1,
"text": "World",
"start": 4.0,
"end": 5.0
}
],
"language": "en",
"input_length_ms": 0,
"request_id": null,
"inference_status": {
"status": "unknown",
"runtime_ms": 0,
"cost": 0.0,
"tokens_generated": 0,
"tokens_input": 0
}
}
language
stringlanguage that the audio is in; uses detected language if None; use two letter language code (ISO 639-1) (e.g. en, de, ja)
webhook
fileThe webhook to call when inference is done, by default you will get the output in the response of your inference request