Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation.
Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation.
You can POST to our OpenAI Transcriptions and Translations compatible endpoint.
For a given audio file and model, the endpoint will return the transcription object or a verbose transcription object.
flac
, mp3
, mp4
, mpeg
, mpga
, m4a
, ogg
, wav
, and webm
.openai/whisper-large-v3-turbo
for this case. For other models, refer to models/automatic-speech-recognition.json
(default), text
, srt
, verbose_json
, vtt
.response_format
to be set to verbose_json
. Options: word
- generates timestamps for individual words, segment
- generates timestamps for segments. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.The transcription object or a verbose transcription object.
curl "https://api.deepinfra.com/v1/audio/transcriptions" \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-F file="@/path/to/file/audio.mp3" \
-F model="openai/whisper-large-v3-turbo"
{
"text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that."
}
curl "https://api.deepinfra.com/v1/audio/transcriptions" \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-F file="@/path/to/file/audio.mp3" \
-F model="openai/whisper-large-v3-turbo" \
-F response_format="verbose_json" \
-F "timestamp_granularities[]=word"
{
"task": "transcribe",
"language": "english",
"duration": 8.470000267028809,
"text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",
"words": [
{
"word": "The",
"start": 0.0,
"end": 0.23999999463558197
},
...
{
"word": "volleyball",
"start": 7.400000095367432,
"end": 7.900000095367432
}
]
}
curl "https://api.deepinfra.com/v1/audio/transcriptions" \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-F file="@/path/to/file/audio.mp3" \
-F model="openai/whisper-large-v3-turbo" \
-F response_format="verbose_json" \
-F "timestamp_granularities[]=segment"
{
"task": "transcribe",
"language": "english",
"duration": 8.470000267028809,
"text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",
"segments": [
{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 3.319999933242798,
"text": " The beach was a popular spot on a hot summer day.",
"tokens": [
50364, 440, 7534, 390, 257, 3743, 4008, 322, 257, 2368, 4266, 786, 13, 50530
],
"temperature": 0.0,
"avg_logprob": -0.2860786020755768,
"compression_ratio": 1.2363636493682861,
"no_speech_prob": 0.00985979475080967
},
...
]
}
For a given audio file and model, the endpoint will return the translated text to English.
flac
, mp3
, mp4
, mpeg
, mpga
, m4a
, ogg
, wav
, and webm
.openai/whisper-large-v3-turbo
for this case. For other models, refer to models/automatic-speech-recognition.json
(default), text
, srt
, verbose_json
, vtt
.The translated text to English.
curl "https://api.deepinfra.com/v1/audio/translations" \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-F file="@/path/to/file/german.m4a" \
-F model="openai/whisper-large-v3-turbo"
{
"text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}
response_format
stringThe format of the output
Default value: "json"
Allowed values: json
verbose_json
text
srt
vtt
temperature
numberThe sampling temperature, between 0 and 1. Higher values produce more creative results.
Default value: 0
Range: 0 ≤ temperature ≤ 1
timestamp_granularities
arrayAn array specifying the granularity of timestamps to include in the transcription. Possible values are 'segment', 'word'.