We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

Browse deepinfra models:

All categories and models you can try out and directly use in deepinfra:

Viewing all

featured

text-generation

automatic-speech-recognition

text-to-speech

embeddings

text-to-video

text-to-image

reranker

zero-shot-image-classification

multimodal

Category/automatic-speech-recognition

Automatic Speech Recognition (ASR) AI models are a critical component of many modern applications, including virtual assistants, dictation software, and transcription services. These models use machine learning techniques to transcribe spoken language into written text, enabling computers to understand and respond to spoken commands.

There are many different types of ASR models, each with its own strengths and weaknesses. Traditional models include hidden Markov models (HMMs) and Gaussian mixture models (GMMs), while more recent models use deep learning techniques such as recurrent neural networks (RNNs), long short-term memory networks (LSTMs), convolutional neural networks (CNNs), and transformers.

While ASR models have made significant progress in recent years, they still face challenges in noisy environments, with multiple speakers, and with accented or non-standard speech. Nevertheless, they are becoming increasingly accurate and versatile, enabling new and exciting applications in areas such as healthcare, education, and entertainment.

mistralai/Voxtral-Small-24B-2507 cover image

featured

bf16

$0.00300 / minute

mistralai/

Voxtral-Small-24B-2507

automatic-speech-recognition

Voxtral Small is an enhancement of Mistral Small 3, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.

mistralai/Voxtral-Mini-3B-2507 cover image

featured

bf16

$0.00100 / minute

mistralai/

Voxtral-Mini-3B-2507

automatic-speech-recognition

Voxtral Mini is an enhancement of Ministral 3B, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.

openai/whisper-large-v3-turbo cover image

featured

$0.00020 / minute

openai/

whisper-large-v3-turbo

automatic-speech-recognition

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation.

Replaced

openai/

whisper-base

automatic-speech-recognition

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It was trained on 680k hours of labelled data and demonstrates a strong ability to generalize to many datasets and domains without fine-tuning. The model is based on a Transformer encoder-decoder architecture. Whisper models are available for various languages including English, Spanish, French, German, Italian, Portuguese, Russian, Chinese, Japanese, Korean, and many more.

Replaced

openai/

whisper-base.en

automatic-speech-recognition

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It was trained on 680k hours of labelled data and demonstrated a strong ability to generalise to many datasets and domains without fine-tuning. Whisper checks pens are available in five configurations of varying model sizes, including a smallest configuration trained on English-only data and a largest configuration trained on multilingual data. This one is English-only.

$0.00045 / minute

openai/

whisper-large-v3

automatic-speech-recognition

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

Replaced

openai/

whisper-medium.en

automatic-speech-recognition

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without fine-tuning. The primary intended users of these models are AI researchers studying robustness, generalisation, and capabilities of the current model.

Replaced

openai/

whisper-small.en

automatic-speech-recognition

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation, trained on 680k hours of labelled data without the need for fine-tuning. It is a Transformer based encoder-decoder model, trained on either English-only or multilingual data, and is available in five configurations of varying model sizes. The models were trained on the tasks of speech recognition and speech translation, predicting transcriptions in the same or different languages as the audio.

Replaced

openai/

whisper-timestamped-medium

automatic-speech-recognition

Whisper is a set of multi-lingual, robust speech recognition models trained by OpenAI that achieve state-of-the-art results in many languages. Whisper models were trained to predict approximate timestamps on speech segments (most of the time with 1-second accuracy), but they cannot originally predict word timestamps. This version has implementation to predict word timestamps and provide a more accurate estimation of speech segments when transcribing with Whisper models.

Replaced

openai/

whisper-timestamped-medium.en

automatic-speech-recognition

Whisper is a set of multi-lingual, robust speech recognition models trained by OpenAI that achieve state-of-the-art results in many languages. Whisper models were trained to predict approximate timestamps on speech segments (most of the time with 1-second accuracy), but they cannot originally predict word timestamps. This variant contains implementation to predict word timestamps and provide a more accurate estimation of speech segments when transcribing with Whisper models.

Replaced

openai/

whisper-tiny.en

automatic-speech-recognition

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation, trained on 680k hours of labeled data without fine-tuning. It's a Transformer based encoder-decoder model, trained on English-only or multilingual data, predicting transcriptions in the same or different language as the audio. Whisper checkpoints come in five configurations of varying model sizes.

Unlock the most affordable AI hosting

Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.

Contact Sales Get Started

Latest Models

Phind/

Phind-CodeLlama-34B-v2

Gryphe/

MythoMax-L2-13b

bigcode/

starcoder2-15b

openai/

whisper-tiny

openchat/

openchat_3.5

Featured Models

anthropic/

claude-4-sonnet

Qwen/

Qwen3-14B

meta-llama/

Llama-4-Scout-17B-16E-Instruct

google/

gemini-2.5-flash

meta-llama/

Llama-3.3-70B-Instruct-Turbo

Qwen/

Qwen3-Coder-480B-A35B-Instruct

Company

Pricing

Docs

Compare

DeepStart

About

Careers

Trust Center

Privacy

Terms

Have questions or need a custom solution?

Contact Sales