openai/whisper-large-v3-turbo cover image
featured

openai/whisper-large-v3-turbo

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation.

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation.

Public
$0.00020 / minute
PaperLicense

HTTP/cURL API

You can use cURL or any other http client to run inferences:

curl -X POST \
    -H "Authorization: bearer $DEEPINFRA_TOKEN"  \
    -F audio=@my_voice.mp3  \
    'https://api.deepinfra.com/v1/inference/openai/whisper-large-v3-turbo'

which will give you back something similar to:

{
  "text": "",
  "segments": [
    {
      "id": 0,
      "text": "Hello",
      "start": 0.0,
      "end": 1.0
    },
    {
      "id": 1,
      "text": "World",
      "start": 4.0,
      "end": 5.0
    }
  ],
  "language": "en",
  "input_length_ms": 0,
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0
  }
}

Input fields

audiostring

audio to transcribe


taskstring

task to perform

Default value: "transcribe"

Allowed values: transcribetranslate


initial_promptstring

optional text to provide as a prompt for the first window.


temperaturenumber

temperature to use for sampling

Default value: 0


languagestring

language that the audio is in; uses detected language if None; use two letter language code (ISO 639-1) (e.g. en, de, ja)


chunk_levelstring

chunk level, either 'segment' or 'word'

Default value: "segment"

Allowed values: segmentword


chunk_length_sinteger

chunk length in seconds to split audio

Default value: 30

Range: 1 ≤ chunk_length_s ≤ 30


webhookfile

The webhook to call when inference is done, by default you will get the output in the response of your inference request

Input Schema

Output Schema