Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.
Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.
You can use cURL or any other http client to run inferences:
curl -X POST \
-H "Authorization: bearer $DEEPINFRA_TOKEN" \
-F audio=@my_voice.mp3 \
'https://api.deepinfra.com/v1/inference/openai/whisper-large-v3'
which will give you back something similar to:
{
"text": "",
"segments": [
{
"id": 0,
"text": "Hello",
"start": 0.0,
"end": 1.0
},
{
"id": 1,
"text": "World",
"start": 4.0,
"end": 5.0
}
],
"language": "en",
"input_length_ms": 0,
"request_id": null,
"inference_status": {
"status": "unknown",
"runtime_ms": 0,
"cost": 0.0,
"tokens_generated": 0,
"tokens_input": 0
}
}
language
stringlanguage that the audio is in; uses detected language if None; use two letter language code (ISO 639-1) (e.g. en, de, ja)
Allowed values: af
am
ar
as
az
ba
be
bg
bn
bo
br
bs
ca
cs
cy
da
de
el
en
es
et
eu
fa
fi
fo
fr
gl
gu
ha
haw
he
hi
hr
ht
hu
hy
id
is
it
ja
jw
ka
kk
km
kn
ko
la
lb
ln
lo
lt
lv
mg
mi
mk
ml
mn
mr
ms
mt
my
ne
nl
nn
no
oc
pa
pl
ps
pt
ro
ru
sa
sd
si
sk
sl
sn
so
sq
sr
su
sv
sw
ta
te
tg
th
tk
tl
tr
tt
uk
ur
uz
vi
yi
yo
yue
zh
chunk_level
stringchunk level, either 'segment' or 'word'
Default value: "segment"
Allowed values: segment
word
chunk_length_s
integerchunk length in seconds to split audio
Default value: 30
Range: 1 ≤ chunk_length_s ≤ 30
webhook
fileThe webhook to call when inference is done, by default you will get the output in the response of your inference request