Whisper is a set of multi-lingual, robust speech recognition models trained by OpenAI that achieve state-of-the-art results in many languages. Whisper models were trained to predict approximate timestamps on speech segments (most of the time with 1-second accuracy), but they cannot originally predict word timestamps. This version has implementation to predict word timestamps and provide a more accurate estimation of speech segments when transcribing with Whisper models.
Whisper is a set of multi-lingual, robust speech recognition models trained by OpenAI that achieve state-of-the-art results in many languages. Whisper models were trained to predict approximate timestamps on speech segments (most of the time with 1-second accuracy), but they cannot originally predict word timestamps. This version has implementation to predict word timestamps and provide a more accurate estimation of speech segments when transcribing with Whisper models.
You can POST to our OpenAI Transcriptions and Translations compatible endpoint.
For a given audio file and model, the endpoint will return the transcription object or a verbose transcription object.
flac
, mp3
, mp4
, mpeg
, mpga
, m4a
, ogg
, wav
, and webm
.openai/whisper-timestamped-medium
for this case. For other models, refer to models/automatic-speech-recognition.json
(default), text
, srt
, verbose_json
, vtt
.response_format
to be set to verbose_json
. Options: word
- generates timestamps for individual words, segment
- generates timestamps for segments. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.The transcription object or a verbose transcription object.
curl "https://api.deepinfra.com/v1/openai/audio/transcriptions" \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-F file="@/path/to/file/audio.mp3" \
-F model="openai/whisper-timestamped-medium"
curl "https://api.deepinfra.com/v1/openai/audio/transcriptions" \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-F file="@/path/to/file/audio.mp3" \
-F model="openai/whisper-timestamped-medium" \
-F response_format="verbose_json" \
-F "timestamp_granularities[]=word"
curl "https://api.deepinfra.com/v1/openai/audio/transcriptions" \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-F file="@/path/to/file/audio.mp3" \
-F model="openai/whisper-timestamped-medium" \
-F response_format="verbose_json" \
-F "timestamp_granularities[]=segment"
For a given audio file and model, the endpoint will return the translated text to English.
flac
, mp3
, mp4
, mpeg
, mpga
, m4a
, ogg
, wav
, and webm
.openai/whisper-timestamped-medium
for this case. For other models, refer to models/automatic-speech-recognition.json
(default), text
, srt
, verbose_json
, vtt
.The translated text to English.
curl "https://api.deepinfra.com/v1/openai/audio/translations" \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-F file="@/path/to/file/german.m4a" \
-F model="openai/whisper-timestamped-medium"
response_format
stringThe format of the output
Default value: "json"
Allowed values: json
verbose_json
text
srt
vtt
temperature
numberThe sampling temperature, between 0 and 1. Higher values produce more creative results.
Default value: 0
Range: 0 ≤ temperature ≤ 1
timestamp_granularities
arrayAn array specifying the granularity of timestamps to include in the transcription. Possible values are 'segment', 'word'.