openai/whisper-medium

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It was trained on 680k hours of labeled data and demonstrates strong abilities to generalize to various datasets and domains without fine-tuning. The model is based on a Transformer encoder-decoder architecture.

Due to low usage this model has been replaced by openai/whisper-large-v3. Your inference requests are still working but they are redirected. Please update your code to use another model.

Public

demoversions

OpenAI Speech-to-Text HTTP/cURL API

You can POST to our OpenAI Transcriptions and Translations compatible endpoint.

Create transcription

For a given audio file and model, the endpoint will return the transcription object or a verbose transcription object.

Request body

file (Required): The audio file object to transcribe. Supported formats are flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, and webm.
model (Required): ID of the model to use. Only openai/whisper-medium for this case. For other models, refer to models/automatic-speech-recognition.
language (Optional): The language of the input audio. Supplying the input language in ISO-639-1 format can improve accuracy and latency.
prompt (Optional): An optional text prompt to guide the model's style or continue a previous audio segment. The prompt should match the audio language.
response_format (Optional): The format of the output. Options include: json (default), text, srt, verbose_json, vtt.
temperature (Optional): Controls the sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 make it more focused and deterministic. If set to 0, the model will adjust automatically to increase temperature as needed.
timestamp_granularities[] (Optional): Specifies the timestamp granularity for transcription. Requires response_format to be set to verbose_json. Options: word - generates timestamps for individual words, segment - generates timestamps for segments. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.

Response body

The transcription object or a verbose transcription object.

Basic request

curl "https://api.deepinfra.com/v1/openai/audio/transcriptions" \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -F file="@/path/to/file/audio.mp3" \
  -F model="openai/whisper-medium"

{
  "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that."
}

Word timestamp request

curl "https://api.deepinfra.com/v1/openai/audio/transcriptions" \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -F file="@/path/to/file/audio.mp3" \
  -F model="openai/whisper-medium" \
  -F response_format="verbose_json" \
  -F "timestamp_granularities[]=word"

{
  "task": "transcribe",
  "language": "english",
  "duration": 8.470000267028809,
  "text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",
  "words": [
    {
      "word": "The",
      "start": 0.0,
      "end": 0.23999999463558197
    },
    ...
    {
      "word": "volleyball",
      "start": 7.400000095367432,
      "end": 7.900000095367432
    }
  ]
}

Segment timestamp request

curl "https://api.deepinfra.com/v1/openai/audio/transcriptions" \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -F file="@/path/to/file/audio.mp3" \
  -F model="openai/whisper-medium" \
  -F response_format="verbose_json" \
  -F "timestamp_granularities[]=segment"

{
  "task": "transcribe",
  "language": "english",
  "duration": 8.470000267028809,
  "text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 3.319999933242798,
      "text": " The beach was a popular spot on a hot summer day.",
      "tokens": [
        50364, 440, 7534, 390, 257, 3743, 4008, 322, 257, 2368, 4266, 786, 13, 50530
      ],
      "temperature": 0.0,
      "avg_logprob": -0.2860786020755768,
      "compression_ratio": 1.2363636493682861,
      "no_speech_prob": 0.00985979475080967
    },
    ...
  ]
}

Create translation

For a given audio file and model, the endpoint will return the translated text to English.

Request body

file (Required): The audio file object to translate. Supported formats are flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, and webm.
model (Required): ID of the model to use. Only openai/whisper-medium for this case. For other models, refer to models/automatic-speech-recognition.
prompt (Optional): An optional text to guide the model's style or continue a previous audio segment. The prompt should be in English.
response_format (Optional): The format of the output. Options include: json (default), text, srt, verbose_json, vtt.
temperature (Optional): The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

Response body

The translated text to English.

Basic request

curl "https://api.deepinfra.com/v1/openai/audio/translations" \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -F file="@/path/to/file/german.m4a" \
  -F model="openai/whisper-medium"

{
  "text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}

Allowed values: jsonverbose_jsontextsrtvtt

`temperature`number

The sampling temperature, between 0 and 1. Higher values produce more creative results.

Default value: 0

Range: 0 ≤ temperature ≤ 1

`timestamp_granularities`array

An array specifying the granularity of timestamps to include in the transcription. Possible values are 'segment', 'word'.

openai/whisper-medium

OpenAI Speech-to-Text HTTP/cURL API

Create transcription

Request body

Response body

Basic request

Word timestamp request

Segment timestamp request

Create translation

Request body

Response body

Basic request

Input fields

`model`string

`file`string

`language`string

`prompt`string

`response_format`string

`temperature`number

`timestamp_granularities`array

Input Schema

Output Schema

openai/whisper-medium

OpenAI Speech-to-Text HTTP/cURL API

Create transcription

Request body

Response body

Basic request

Word timestamp request

Segment timestamp request

Create translation

Request body

Response body

Basic request

Input fields

modelstring

filestring

languagestring

promptstring

response_formatstring

temperaturenumber

timestamp_granularitiesarray

Input Schema

Output Schema

`model`string

`file`string

`language`string

`prompt`string

`response_format`string

`temperature`number

`timestamp_granularities`array