openai/whisper-small cover image

openai/whisper-small

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It was trained on 680k hours of labelled data and demonstrates a strong ability to generalize to many datasets and domains without the need for fine-tuning. The model is based on a Transformer architecture and uses a large-scale weak supervision technique.

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It was trained on 680k hours of labelled data and demonstrates a strong ability to generalize to many datasets and domains without the need for fine-tuning. The model is based on a Transformer architecture and uses a large-scale weak supervision technique.

Public

OpenAI Speech-to-Text HTTP/cURL API

You can POST to our OpenAI Transcriptions and Translations compatible endpoint.

Create transcription

For a given audio file and model, the endpoint will return the transcription object or a verbose transcription object.

Request body

  • file (Required): The audio file object to transcribe. Supported formats are flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, and webm.
  • model (Required): ID of the model to use. Only openai/whisper-small for this case. For other models, refer to models/automatic-speech-recognition.
  • language (Optional): The language of the input audio. Supplying the input language in ISO-639-1 format can improve accuracy and latency.
  • prompt (Optional): An optional text prompt to guide the model's style or continue a previous audio segment. The prompt should match the audio language.
  • response_format (Optional): The format of the output. Options include: json (default), text, srt, verbose_json, vtt.
  • temperature (Optional): Controls the sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 make it more focused and deterministic. If set to 0, the model will adjust automatically to increase temperature as needed.
  • timestamp_granularities[] (Optional): Specifies the timestamp granularity for transcription. Requires response_format to be set to verbose_json. Options: word - generates timestamps for individual words, segment - generates timestamps for segments. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.

Response body

The transcription object or a verbose transcription object.

Basic request

curl "https://api.deepinfra.com/v1/audio/transcriptions" \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -F file="@/path/to/file/audio.mp3" \
  -F model="openai/whisper-small"
{
  "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that."
}

Word timestamp request

curl "https://api.deepinfra.com/v1/audio/transcriptions" \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -F file="@/path/to/file/audio.mp3" \
  -F model="openai/whisper-small" \
  -F response_format="verbose_json" \
  -F "timestamp_granularities[]=word"
{
  "task": "transcribe",
  "language": "english",
  "duration": 8.470000267028809,
  "text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",
  "words": [
    {
      "word": "The",
      "start": 0.0,
      "end": 0.23999999463558197
    },
    ...
    {
      "word": "volleyball",
      "start": 7.400000095367432,
      "end": 7.900000095367432
    }
  ]
}

Segment timestamp request

curl "https://api.deepinfra.com/v1/audio/transcriptions" \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -F file="@/path/to/file/audio.mp3" \
  -F model="openai/whisper-small" \
  -F response_format="verbose_json" \
  -F "timestamp_granularities[]=segment"
{
  "task": "transcribe",
  "language": "english",
  "duration": 8.470000267028809,
  "text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 3.319999933242798,
      "text": " The beach was a popular spot on a hot summer day.",
      "tokens": [
        50364, 440, 7534, 390, 257, 3743, 4008, 322, 257, 2368, 4266, 786, 13, 50530
      ],
      "temperature": 0.0,
      "avg_logprob": -0.2860786020755768,
      "compression_ratio": 1.2363636493682861,
      "no_speech_prob": 0.00985979475080967
    },
    ...
  ]
}

Create translation

For a given audio file and model, the endpoint will return the translated text to English.

Request body

  • file (Required): The audio file object to translate. Supported formats are flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, and webm.
  • model (Required): ID of the model to use. Only openai/whisper-small for this case. For other models, refer to models/automatic-speech-recognition.
  • prompt (Optional): An optional text to guide the model's style or continue a previous audio segment. The prompt should be in English.
  • response_format (Optional): The format of the output. Options include: json (default), text, srt, verbose_json, vtt.
  • temperature (Optional): The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

Response body

The translated text to English.

Basic request

curl "https://api.deepinfra.com/v1/audio/translations" \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -F file="@/path/to/file/german.m4a" \
  -F model="openai/whisper-small"
{
  "text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}

Input fields

modelstring

Model name to use


languagestring

The language of the input audio


promptstring

An optional text to guide the model's style or continue a previous audio segment.


response_formatstring

The format of the output

Default value: "json"

Allowed values: jsonverbose_jsontextsrtvtt


temperaturenumber

The sampling temperature, between 0 and 1. Higher values produce more creative results.

Default value: 0

Range: 0 ≤ temperature ≤ 1


timestamp_granularitiesarray

An array specifying the granularity of timestamps to include in the transcription. Possible values are 'segment', 'word'.

Input Schema

Output Schema