Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation.
Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation.
You can use cURL or any other http client to run inferences:
curl -X POST \
-H "Authorization: bearer $DEEPINFRA_TOKEN" \
-F audio=@my_voice.mp3 \
'https://api.deepinfra.com/v1/inference/openai/whisper-large-v3-turbo'
which will give you back something similar to:
{
"text": "",
"segments": [
{
"id": 0,
"text": "Hello",
"start": 0.0,
"end": 1.0
},
{
"id": 1,
"text": "World",
"start": 4.0,
"end": 5.0
}
],
"language": "en",
"input_length_ms": 0,
"request_id": null,
"inference_status": {
"status": "unknown",
"runtime_ms": 0,
"cost": 0.0,
"tokens_generated": 0,
"tokens_input": 0
}
}
language
stringlanguage that the audio is in; uses detected language if None; use two letter language code (ISO 639-1) (e.g. en, de, ja)
Allowed values: af
am
ar
as
az
ba
be
bg
bn
bo
br
bs
ca
cs
cy
da
de
el
en
es
et
eu
fa
fi
fo
fr
gl
gu
ha
haw
he
hi
hr
ht
hu
hy
id
is
it
ja
jw
ka
kk
km
kn
ko
la
lb
ln
lo
lt
lv
mg
mi
mk
ml
mn
mr
ms
mt
my
ne
nl
nn
no
oc
pa
pl
ps
pt
ro
ru
sa
sd
si
sk
sl
sn
so
sq
sr
su
sv
sw
ta
te
tg
th
tk
tl
tr
tt
uk
ur
uz
vi
yi
yo
yue
zh
chunk_level
stringchunk level, either 'segment' or 'word'
Default value: "segment"
Allowed values: segment
word
chunk_length_s
integerchunk length in seconds to split audio
Default value: 30
Range: 1 ≤ chunk_length_s ≤ 30
webhook
fileThe webhook to call when inference is done, by default you will get the output in the response of your inference request