Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It was trained on 680k hours of labelled data and demonstrates a strong ability to generalize to many datasets and domains without the need for fine-tuning. The model is based on a Transformer architecture and uses a large-scale weak supervision technique.
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It was trained on 680k hours of labelled data and demonstrates a strong ability to generalize to many datasets and domains without the need for fine-tuning. The model is based on a Transformer architecture and uses a large-scale weak supervision technique.
You can POST to our OpenAI Transcriptions and Translations compatible endpoint.
For a given audio file and model, the endpoint will return the transcription object or a verbose transcription object.
flac
, mp3
, mp4
, mpeg
, mpga
, m4a
, ogg
, wav
, and webm
.openai/whisper-small
for this case. For other models, refer to models/automatic-speech-recognition.json
(default), text
, srt
, verbose_json
, vtt
.response_format
to be set to verbose_json
. Options: word
- generates timestamps for individual words, segment
- generates timestamps for segments. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.The transcription object or a verbose transcription object.
curl "https://api.deepinfra.com/v1/audio/transcriptions" \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-F file="@/path/to/file/audio.mp3" \
-F model="openai/whisper-small"
{
"text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that."
}
curl "https://api.deepinfra.com/v1/audio/transcriptions" \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-F file="@/path/to/file/audio.mp3" \
-F model="openai/whisper-small" \
-F response_format="verbose_json" \
-F "timestamp_granularities[]=word"
{
"task": "transcribe",
"language": "english",
"duration": 8.470000267028809,
"text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",
"words": [
{
"word": "The",
"start": 0.0,
"end": 0.23999999463558197
},
...
{
"word": "volleyball",
"start": 7.400000095367432,
"end": 7.900000095367432
}
]
}
curl "https://api.deepinfra.com/v1/audio/transcriptions" \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-F file="@/path/to/file/audio.mp3" \
-F model="openai/whisper-small" \
-F response_format="verbose_json" \
-F "timestamp_granularities[]=segment"
{
"task": "transcribe",
"language": "english",
"duration": 8.470000267028809,
"text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",
"segments": [
{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 3.319999933242798,
"text": " The beach was a popular spot on a hot summer day.",
"tokens": [
50364, 440, 7534, 390, 257, 3743, 4008, 322, 257, 2368, 4266, 786, 13, 50530
],
"temperature": 0.0,
"avg_logprob": -0.2860786020755768,
"compression_ratio": 1.2363636493682861,
"no_speech_prob": 0.00985979475080967
},
...
]
}
For a given audio file and model, the endpoint will return the translated text to English.
flac
, mp3
, mp4
, mpeg
, mpga
, m4a
, ogg
, wav
, and webm
.openai/whisper-small
for this case. For other models, refer to models/automatic-speech-recognition.json
(default), text
, srt
, verbose_json
, vtt
.The translated text to English.
curl "https://api.deepinfra.com/v1/audio/translations" \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-F file="@/path/to/file/german.m4a" \
-F model="openai/whisper-small"
{
"text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}
response_format
stringThe format of the output
Default value: "json"
Allowed values: json
verbose_json
text
srt
vtt
temperature
numberThe sampling temperature, between 0 and 1. Higher values produce more creative results.
Default value: 0
Range: 0 ≤ temperature ≤ 1
timestamp_granularities
arrayAn array specifying the granularity of timestamps to include in the transcription. Possible values are 'segment', 'word'.