openai/whisper-base cover image

openai/whisper-base

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It was trained on 680k hours of labelled data and demonstrates a strong ability to generalize to many datasets and domains without fine-tuning. The model is based on a Transformer encoder-decoder architecture. Whisper models are available for various languages including English, Spanish, French, German, Italian, Portuguese, Russian, Chinese, Japanese, Korean, and many more.

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It was trained on 680k hours of labelled data and demonstrates a strong ability to generalize to many datasets and domains without fine-tuning. The model is based on a Transformer encoder-decoder architecture. Whisper models are available for various languages including English, Spanish, French, German, Italian, Portuguese, Russian, Chinese, Japanese, Korean, and many more.

Public

HTTP/cURL API

You can use cURL or any other http client to run inferences:

curl -X POST \
    -H "Authorization: bearer $DEEPINFRA_TOKEN"  \
    -F audio=@my_voice.mp3  \
    'https://api.deepinfra.com/v1/inference/openai/whisper-base'

which will give you back something similar to:

{
  "text": "",
  "segments": [
    {
      "id": 0,
      "text": "Hello",
      "start": 0.0,
      "end": 1.0
    },
    {
      "id": 1,
      "text": "World",
      "start": 4.0,
      "end": 5.0
    }
  ],
  "language": "en",
  "input_length_ms": 0,
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0
  }
}

Input fields

audiostring

audio to transcribe


taskstring

task to perform

Default value: "transcribe"

Allowed values: transcribetranslate


initial_promptstring

optional text to provide as a prompt for the first window.


temperaturenumber

temperature to use for sampling

Default value: 0


languagestring

language that the audio is in; uses detected language if None


webhookfile

The webhook to call when inference is done, by default you will get the output in the response of your inference request

Input Schema

Output Schema