openai/whisper-large

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

Due to low usage this model has been replaced by openai/whisper-large-v3. Your inference requests are still working but they are redirected. Please update your code to use another model.

Public

demoversions

OpenAI Speech-to-Text HTTP/cURL API

You can POST to our OpenAI Transcriptions and Translations compatible endpoint.

Create transcription

For a given audio file and model, the endpoint will return the transcription object or a verbose transcription object.

Request body

file (Required): The audio file object to transcribe. Supported formats are flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, and webm.
model (Required): ID of the model to use. Only openai/whisper-large for this case. For other models, refer to models/automatic-speech-recognition.
language (Optional): The language of the input audio. Supplying the input language in ISO-639-1 format can improve accuracy and latency.
prompt (Optional): An optional text prompt to guide the model's style or continue a previous audio segment. The prompt should match the audio language.
response_format (Optional): The format of the output. Options include: json (default), text, srt, verbose_json, vtt.
temperature (Optional): Controls the sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 make it more focused and deterministic. If set to 0, the model will adjust automatically to increase temperature as needed.
timestamp_granularities[] (Optional): Specifies the timestamp granularity for transcription. Requires response_format to be set to verbose_json. Options: word - generates timestamps for individual words, segment - generates timestamps for segments. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.

Response body

The transcription object or a verbose transcription object.

Basic request

curl "https://api.deepinfra.com/v1/openai/audio/transcriptions" \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -F file="@/path/to/file/audio.mp3" \
  -F model="openai/whisper-large"

{
  "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that."
}

Word timestamp request

curl "https://api.deepinfra.com/v1/openai/audio/transcriptions" \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -F file="@/path/to/file/audio.mp3" \
  -F model="openai/whisper-large" \
  -F response_format="verbose_json" \
  -F "timestamp_granularities[]=word"

{
  "task": "transcribe",
  "language": "english",
  "duration": 8.470000267028809,
  "text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",
  "words": [
    {
      "word": "The",
      "start": 0.0,
      "end": 0.23999999463558197
    },
    ...
    {
      "word": "volleyball",
      "start": 7.400000095367432,
      "end": 7.900000095367432
    }
  ]
}

Segment timestamp request

curl "https://api.deepinfra.com/v1/openai/audio/transcriptions" \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -F file="@/path/to/file/audio.mp3" \
  -F model="openai/whisper-large" \
  -F response_format="verbose_json" \
  -F "timestamp_granularities[]=segment"

{
  "task": "transcribe",
  "language": "english",
  "duration": 8.470000267028809,
  "text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 3.319999933242798,
      "text": " The beach was a popular spot on a hot summer day.",
      "tokens": [
        50364, 440, 7534, 390, 257, 3743, 4008, 322, 257, 2368, 4266, 786, 13, 50530
      ],
      "temperature": 0.0,
      "avg_logprob": -0.2860786020755768,
      "compression_ratio": 1.2363636493682861,
      "no_speech_prob": 0.00985979475080967
    },
    ...
  ]
}

Create translation

For a given audio file and model, the endpoint will return the translated text to English.

Request body

file (Required): The audio file object to translate. Supported formats are flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, and webm.
model (Required): ID of the model to use. Only openai/whisper-large for this case. For other models, refer to models/automatic-speech-recognition.
prompt (Optional): An optional text to guide the model's style or continue a previous audio segment. The prompt should be in English.
response_format (Optional): The format of the output. Options include: json (default), text, srt, verbose_json, vtt.
temperature (Optional): The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

Response body

The translated text to English.

Basic request

curl "https://api.deepinfra.com/v1/openai/audio/translations" \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -F file="@/path/to/file/german.m4a" \
  -F model="openai/whisper-large"

{
  "text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}

Allowed values: jsonverbose_jsontextsrtvtt

`temperature`number

The sampling temperature, between 0 and 1. Higher values produce more creative results.

Default value: 0

Range: 0 ≤ temperature ≤ 1

`timestamp_granularities`array

An array specifying the granularity of timestamps to include in the transcription. Possible values are 'segment', 'word'.

Input Schema

Output Schema

Latest Models

Phind/

Phind-CodeLlama-34B-v2

openai/

whisper-tiny

Gryphe/

MythoMax-L2-13b

openchat/

openchat_3.5

bigcode/

starcoder2-15b

Featured Models

openai/

whisper-large-v3-turbo

meta-llama/

Meta-Llama-3.1-8B-Instruct

black-forest-labs/

FLUX-pro

black-forest-labs/

FLUX-1-schnell

meta-llama/

Meta-Llama-3.1-70B-Instruct-Turbo

microsoft/

WizardLM-2-8x22B

Company

Pricing

Docs

Compare

DeepStart

About

Careers

Privacy

Terms

openai/whisper-large

OpenAI Speech-to-Text HTTP/cURL API

Create transcription

Request body

Response body

Basic request

Word timestamp request

Segment timestamp request

Create translation

Request body

Response body

Basic request

Input fields

`model`string

`file`string

`language`string

`prompt`string

`response_format`string

`temperature`number

`timestamp_granularities`array

Input Schema

Output Schema

openai/whisper-large

OpenAI Speech-to-Text HTTP/cURL API

Create transcription

Request body

Response body

Basic request

Word timestamp request

Segment timestamp request

Create translation

Request body

Response body

Basic request

Input fields

modelstring

filestring

languagestring

promptstring

response_formatstring

temperaturenumber

timestamp_granularitiesarray

Input Schema

Output Schema

`model`string

`file`string

`language`string

`prompt`string

`response_format`string

`temperature`number

`timestamp_granularities`array