Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation, trained on 680k hours of labeled data without fine-tuning. It's a Transformer based encoder-decoder model, trained on English-only or multilingual data, predicting transcriptions in the same or different language as the audio. Whisper checkpoints come in five configurations of varying model sizes.
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation, trained on 680k hours of labeled data without fine-tuning. It's a Transformer based encoder-decoder model, trained on English-only or multilingual data, predicting transcriptions in the same or different language as the audio. Whisper checkpoints come in five configurations of varying model sizes.
You can POST to our OpenAI Transcriptions and Translations compatible endpoint.
For a given audio file and model, the endpoint will return the transcription object or a verbose transcription object.
flac
, mp3
, mp4
, mpeg
, mpga
, m4a
, ogg
, wav
, and webm
.openai/whisper-tiny.en
for this case. For other models, refer to models/automatic-speech-recognition.json
(default), text
, srt
, verbose_json
, vtt
.response_format
to be set to verbose_json
. Options: word
- generates timestamps for individual words, segment
- generates timestamps for segments. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.The transcription object or a verbose transcription object.
curl "https://api.deepinfra.com/v1/openai/audio/transcriptions" \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-F file="@/path/to/file/audio.mp3" \
-F model="openai/whisper-tiny.en"
{
"text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that."
}
curl "https://api.deepinfra.com/v1/openai/audio/transcriptions" \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-F file="@/path/to/file/audio.mp3" \
-F model="openai/whisper-tiny.en" \
-F response_format="verbose_json" \
-F "timestamp_granularities[]=word"
{
"task": "transcribe",
"language": "english",
"duration": 8.470000267028809,
"text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",
"words": [
{
"word": "The",
"start": 0.0,
"end": 0.23999999463558197
},
...
{
"word": "volleyball",
"start": 7.400000095367432,
"end": 7.900000095367432
}
]
}
curl "https://api.deepinfra.com/v1/openai/audio/transcriptions" \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-F file="@/path/to/file/audio.mp3" \
-F model="openai/whisper-tiny.en" \
-F response_format="verbose_json" \
-F "timestamp_granularities[]=segment"
{
"task": "transcribe",
"language": "english",
"duration": 8.470000267028809,
"text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",
"segments": [
{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 3.319999933242798,
"text": " The beach was a popular spot on a hot summer day.",
"tokens": [
50364, 440, 7534, 390, 257, 3743, 4008, 322, 257, 2368, 4266, 786, 13, 50530
],
"temperature": 0.0,
"avg_logprob": -0.2860786020755768,
"compression_ratio": 1.2363636493682861,
"no_speech_prob": 0.00985979475080967
},
...
]
}
For a given audio file and model, the endpoint will return the translated text to English.
flac
, mp3
, mp4
, mpeg
, mpga
, m4a
, ogg
, wav
, and webm
.openai/whisper-tiny.en
for this case. For other models, refer to models/automatic-speech-recognition.json
(default), text
, srt
, verbose_json
, vtt
.The translated text to English.
curl "https://api.deepinfra.com/v1/openai/audio/translations" \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-F file="@/path/to/file/german.m4a" \
-F model="openai/whisper-tiny.en"
{
"text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}
response_format
stringThe format of the output
Default value: "json"
Allowed values: json
verbose_json
text
srt
vtt
temperature
numberThe sampling temperature, between 0 and 1. Higher values produce more creative results.
Default value: 0
Range: 0 ≤ temperature ≤ 1
timestamp_granularities
arrayAn array specifying the granularity of timestamps to include in the transcription. Possible values are 'segment', 'word'.