openai/
Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation.
You can use cURL or any other http client to run inferences:
curl -X POST \
-H "Authorization: bearer $DEEPINFRA_TOKEN" \
-F audio=@my_voice.mp3 \
'https://api.deepinfra.com/v1/inference/openai/whisper-large-v3-turbo'
which will give you back something similar to:
{
"text": "",
"segments": [
{
"end": 1.0,
"id": 0,
"start": 0.0,
"text": "Hello"
},
{
"end": 5.0,
"id": 1,
"start": 4.0,
"text": "World"
}
],
"language": "en",
"input_length_ms": 0,
"words": [
{
"end": 1.0,
"start": 0.0,
"text": "Hello"
},
{
"end": 5.0,
"start": 4.0,
"text": "World"
}
],
"duration": 0.0,
"request_id": null,
"inference_status": {
"status": "unknown",
"runtime_ms": 0,
"cost": 0.0,
"tokens_generated": 0,
"tokens_input": 0
}
}
Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.