openai/whisper-tiny cover image


Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It was trained on 680k hours of labelled data and demonstrates a strong ability to generalize to many datasets and domains without fine-tuning. Whisper is a Transformer-based encoder-decoder model trained on English-only or multilingual data. The English-only models were trained on speech recognition, while the multilingual models were trained on both speech recognition and machine translation.

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It was trained on 680k hours of labelled data and demonstrates a strong ability to generalize to many datasets and domains without fine-tuning. Whisper is a Transformer-based encoder-decoder model trained on English-only or multilingual data. The English-only models were trained on speech recognition, while the multilingual models were trained on both speech recognition and machine translation.



You can use cURL or any other http client to run inferences:

curl -X POST \
    -H "Authorization: bearer $DEEPINFRA_TOKEN"  \
    -F audio=@my_voice.mp3  \

which will give you back something similar to:

  "text": "",
  "segments": [
      "id": 0,
      "text": "Hello",
      "start": 0.0,
      "end": 1.0
      "id": 1,
      "text": "World",
      "start": 4.0,
      "end": 5.0
  "language": "en",
  "input_length_ms": 0,
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0

Input fields


audio to transcribe


task to perform

Default value: "transcribe"

Allowed values: transcribetranslate


optional text to provide as a prompt for the first window.


temperature to use for sampling

Default value: 0


language that the audio is in; uses detected language if None; use two letter language code (ISO 639-1) (e.g. en, de, ja)


chunk level, either 'segment' or 'word'

Default value: "segment"

Allowed values: segmentword


chunk length in seconds to split audio

Default value: 30

Range: 1 ≤ chunk_length_s ≤ 30


The webhook to call when inference is done, by default you will get the output in the response of your inference request

Input Schema

Output Schema