openai/whisper-small.en cover image

openai/whisper-small.en

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation, trained on 680k hours of labelled data without the need for fine-tuning. It is a Transformer based encoder-decoder model, trained on either English-only or multilingual data, and is available in five configurations of varying model sizes. The models were trained on the tasks of speech recognition and speech translation, predicting transcriptions in the same or different languages as the audio.

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation, trained on 680k hours of labelled data without the need for fine-tuning. It is a Transformer based encoder-decoder model, trained on either English-only or multilingual data, and is available in five configurations of varying model sizes. The models were trained on the tasks of speech recognition and speech translation, predicting transcriptions in the same or different languages as the audio.

Public
$0.0005 / sec

HTTP/cURL API

You can use cURL or any other http client to run inferences:

curl -X POST \
    -H "Authorization: bearer $(deepctl auth token)"  \
    -F audio=@my_voice.mp3  \
    'https://api.deepinfra.com/v1/inference/openai/whisper-small.en'

which will give you back something similar to:

{
  "text": "",
  "segments": [
    {
      "id": 0,
      "text": "Hello",
      "start": 0.0,
      "end": 1.0
    },
    {
      "id": 1,
      "text": "World",
      "start": 4.0,
      "end": 5.0
    }
  ],
  "language": "en",
  "input_length_ms": 0,
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0
  }
}

Input fields

audiostring

audio to transcribe


taskstring

task to perform

Default value: transcribe

Allowed values: transcribetranslate


languagestring

language that the audio is in; uses detected language if None


temperaturenumber

temperature to use for sampling

Default value: 0


patiencenumber

patience value to use in beam decoding

Default value: 1


suppress_tokensstring

token ids to suppress during sampling

Default value: -1


initial_promptstring

optional text to provide as a prompt for the first window.


condition_on_previous_textboolean

provide the previous output of the model as a prompt for the next window

Default value: true


temperature_increment_on_fallbacknumber

temperature to increase when falling back when the decoding fails to meet either of the thresholds below

Default value: 0.2


compression_ratio_thresholdnumber

gzip compression ratio threshold

Default value: 2.4


logprob_thresholdnumber

average log probability threshold

Default value: -1


no_speech_thresholdnumber

probability of the <|nospeech|> token threshold

Default value: 0.6


webhookfile

The webhook to call when inference is done, by default you will get the output in the response of your inference request

Input Schema

Output Schema