GLM-5.2 is Z-AI's latest flagship model for long-horizon tasks. It marks a substantial leap in long-horizon task capability over its predecessor GLM-5.1 and, for the first time, delivers that capability on a **solid 1M-token context**.

GLM-5.2

Kimi K2.7 Code is a coding-focused agentic model built upon Kimi K2.6. With substantial improvements on real-world long-horizon coding tasks, it strengthens end-to-end task completion across complex software engineering workflows while improving token efficiency, reducing thinking-token usage by approximately 30% compared with Kimi K2.6.

Kimi-K2.7-Code

Nemotron 3 Ultra is built for, frontier reasoning, orchestration, coding agents, deep research, and complex enterprise workflows. It delivers up to 5x faster inference and up to 30% lower cost for agentic workloads while supporting up to 1M token context.

NVIDIA-Nemotron-3-Ultra-550B-A55B

DeepSeek V4 Flash is an efficiency-focused MoE model with 284B total parameters (13B active) and a 1M-token context window. It's tuned for fast inference and high-throughput use cases while still holding up on reasoning and coding tasks.

DeepSeek-V4-Flash

DeepSeek V4 Pro is an MoE model with 1.6T total parameters (49B active) and a 1M-token context window. It's built for advanced reasoning, coding, and long-running agent tasks, and performs well on knowledge, math, and software engineering benchmarks.

DeepSeek-V4-Pro

Kimi K2.6 is an open-source, native multimodal agentic model that advances practical capabilities in long-horizon coding, coding-driven design, proactive autonomous execution, and swarm-based task orchestration.

Kimi-K2.6

MiMo-V2.5-Pro is an open-source Mixture-of-Experts (MoE) language model with 1.02T total parameters and 42B active parameters. It utilizes the hybrid attention architecture and 3-layers Multi-Token Prediction (MTP) introduced in [MiMo-V2-Flash](https://github.com/XiaomiMiMo/MiMo-V2-Flash).

MiMo-V2.5-Pro

Qwen3.6-35B-A3B is Alibaba's latest flagship Mixture-of-Experts model, with 35B total parameters and only 3B activated per token (256 experts, 8 routed + 1 shared). Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience.

Qwen3.6-35B-A3B

GLM-5.1 is Z-AI's next-generation flagship model for agentic engineering, with significantly stronger coding capabilities than its predecessor. It achieves state-of-the-art performance on SWE-Bench Pro and leads GLM-5 by a wide margin on NL2Repo (repo generation) and Terminal-Bench 2.0 (real-world terminal tasks).

GLM-5.1

Qwen3.5-397B-A17B is Alibaba's most capable Qwen3.5 model, a Mixture-of-Experts architecture with 397B total parameters and 17B activated per token. It features a 262K token context window (extensible to 1M with YaRN), thinking/reasoning mode, tool calling with MCP integration, and support for 201 languages. Sets state-of-the-art results on reasoning, coding, math, and multimodal benchmarks.

Qwen3.5-397B-A17B

Efficient, MoE variant of Gemma 4. Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input and generating text output.

gemma-4-26B-A4B-it

Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input and generating text output.

gemma-4-31B-it

NVIDIA Nemotron 3 Super is a hybrid Mixture-of-Experts (MoE) model engineered for highest compute efficiency and accuracy in multi-agent applications and specialized agentic systems. It is optimized to run many collaborating agents per application on a single GPU, delivering high accuracy for reasoning, tool use, and instruction following.

NVIDIA-Nemotron-3-Super-120B-A12B

GLM-5 is an advanced, open-source large language model designed for developers tackling the toughest challenges. It excels at long-context reasoning, multi-step tool orchestration, and complex systems engineering, making it the ideal choice for powering sophisticated agents and applications that require high-level cognitive tasks.

GLM-5

  Qwen3-TTS is an advanced text-to-speech model by Alibaba's Qwen team, delivering stable, expressive, and low-latency speech generation across 10 languages.                                                                                                                                                                                                                                                                                                                                           Key capabilities:                                                                                                                                                                                                                                  - 9 preset voices — Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee — covering diverse genders, ages, and accents                                                                                                              - Voice cloning — clone any voice from a short (~3s) audio sample via the voice_id parameter   - Instruction control — adjust tone, emotion, and speaking style with natural language (e.g. "speak slowly and calmly", "excited tone")   - 10 languages — English, Chinese, Japanese, Korean, German, French, Russian, Spanish, Italian, Portuguese   - Streaming support — real-time PCM streaming with ~97ms first-byte latency   - Multiple output formats — WAV, MP3, FLAC, PCM    Built on a 1.7B parameter architecture using discrete multi-codebook language modeling for end-to-end speech synthesis without cascading errors. Uses a custom 12Hz acoustic tokenizer that preserves paralinguistic information and   environmental audio details.

Qwen3-TTS

● Qwen3-TTS-VoiceDesign is a voice design variant of Qwen3-TTS by Alibaba's Qwen team. Instead of selecting from preset voices, you describe the voice you want in natural language — and the model generates speech in that voice.                                                                                                                                                                                                                                                                     Key capabilities:                                                                                                                                                                                                                                  - Natural language voice control — describe any voice with free text (e.g. "a deep male voice with a calm, authoritative presence", "a young cheerful female with a warm and friendly tone")   - 10 languages — English, Chinese, Japanese, Korean, German, French, Russian, Spanish, Italian, Portuguese                                                                                                                                         - Streaming support — real-time PCM streaming   - Multiple output formats — WAV, MP3, FLAC, PCM    Built on the same 1.7B parameter architecture as Qwen3-TTS, using discrete multi-codebook language modeling and a custom 12Hz acoustic tokenizer for high-quality end-to-end speech synthesis.

Qwen3-TTS-VoiceDesign

The latest flagship model in the Qwen family. State-of-the-art results across a comprehensive suite of benchmarks — including knowledge, reasoning, coding, instruction following, human preference alignment, agent tasks, and multilingual understanding.

Qwen3-Max

The latest flagship reasoning model in the Qwen3 family. Further enhanced by multiple innovations like adaptive tool-use and advanced test-time scaling techniques

Qwen3-Max-Thinking

Kimi K2.5 is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.

Kimi-K2.5

GLM-4.7-Flash is a 30B-A3B MoE model. As the strongest model in the 30B class, GLM-4.7-Flash offers a new option for lightweight deployment that balances performance and efficiency.

GLM-4.7-Flash

DeepSeek-V3.2 is a large language model designed to harmonize high computational efficiency with strong reasoning and agentic tool-use performance. It introduces DeepSeek Sparse Attention (DSA), a fine-grained sparse attention mechanism that reduces training and inference cost while preserving quality in long-context scenarios. A scalable reinforcement learning post-training framework further improves reasoning, with reported performance in the GPT-5 class, and the model has demonstrated gold-medal results on the 2025 IMO and IOI. V3.2 also uses a large-scale agentic task synthesis pipeline to better integrate reasoning into tool-use settings, boosting compliance and generalization in interactive environments.

DeepSeek-V3.2

The fastest model of the Flux 2 family. Frontier visual intelligence — state-of-the-art image generation and editing from Black Forest Labs

FLUX-2-klein-4b

The best quality-to-latency ratio, production apps model of the Flux 2 family. Frontier visual intelligence — state-of-the-art image generation and editing from Black Forest Labs

FLUX-2-klein-9b

claude

Claude

deepseek

DeepSeek

flux

Flux

gemini

Gemini

llama

Llama

mistral

Mistral

nemotron

Nemotron

qwen

Qwen

Voxtral Small is an enhancement of Mistral Small 3, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.

You can use cURL or any other http client to run inferences:

```bash
curl -X POST \
    -H "Authorization: bearer $DEEPINFRA_TOKEN"  \
    -F audio=@my_voice.mp3  \
    'https://api.deepinfra.com/v1/inference/mistralai/Voxtral-Small-24B-2507'
```

which will give you back something similar to:

```json
{
  "text": "",
  "segments": [
    {
      "end": 1.0,
      "id": 0,
      "start": 0.0,
      "text": "Hello"
    },
    {
      "end": 5.0,
      "id": 1,
      "start": 4.0,
      "text": "World"
    }
  ],
  "language": "en",
  "input_length_ms": 0,
  "words": [
    {
      "end": 1.0,
      "start": 0.0,
      "text": "Hello"
    },
    {
      "end": 5.0,
      "start": 4.0,
      "text": "World"
    }
  ],
  "duration": 0.0,
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0,
    "output_length": 0
  }
}

```


You can use our command-line tool [deepctl](/docs/advanced/deepctl) to run
inferences:

```bash
deepctl infer \
    -m 'mistralai/Voxtral-Small-24B-2507'  \
    -i audio=@my_voice.mp3
```

which will give you back something similar to:

```json
{
  "text": "",
  "segments": [
    {
      "end": 1.0,
      "id": 0,
      "start": 0.0,
      "text": "Hello"
    },
    {
      "end": 5.0,
      "id": 1,
      "start": 4.0,
      "text": "World"
    }
  ],
  "language": "en",
  "input_length_ms": 0,
  "words": [
    {
      "end": 1.0,
      "start": 0.0,
      "text": "Hello"
    },
    {
      "end": 5.0,
      "start": 4.0,
      "text": "World"
    }
  ],
  "duration": 0.0,
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0,
    "output_length": 0
  }
}

```


We recommend using our NodeJS client https://github.com/deepinfra/deepinfra-node.

You can install it with

```bash
npm install deepinfra
```

and then

```javascript
import { AutomaticSpeechRecognition } from "deepinfra";
import path from "path";
import { fileURLToPath } from 'url';


const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);

const DEEPINFRA_API_KEY = "$DEEPINFRA_TOKEN";
const MODEL = "mistralai/Voxtral-Small-24B-2507";

const main = async () => {
  const client = new AutomaticSpeechRecognition(MODEL, DEEPINFRA_API_KEY);

  const input = {
    audio: path.join(__dirname, "audio.mp3"),
  };
  const response = await client.generate(input);
  console.log(response.text);
};

main();
```


You can POST to our OpenAI Transcriptions and Translations compatible endpoint.

# Create transcription

For a given audio file and model, the endpoint will return the **transcription object** or a **verbose transcription object**.

## Request body

- **file** (Required): The audio file object to transcribe. Supported formats are `flac`, `mp3`, `mp4`, `mpeg`, `mpga`, `m4a`, `ogg`, `wav`, and `webm`.
- **model** (Required): ID of the model to use. Only `mistralai/Voxtral-Small-24B-2507` for this case. For other models, refer to [models/automatic-speech-recognition](models/automatic-speech-recognition).
- **language** (Optional): The language of the input audio. Supplying the input language in ISO-639-1 format can improve accuracy and latency.
- **prompt** (Optional): An optional text prompt to guide the model's style or continue a previous audio segment. The prompt should match the audio language.
- **response_format** (Optional): The format of the output. Options include: `json` (default), `text`, `srt`, `verbose_json`, `vtt`.
- **temperature** (Optional): Controls the sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 make it more focused and deterministic. If set to 0, the model will adjust automatically to increase temperature as needed.
- **timestamp_granularities[]** (Optional): Specifies the timestamp granularity for transcription. Requires `response_format` to be set to `verbose_json`. Options: `word` - generates timestamps for individual words, `segment` - generates timestamps for segments. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.

## Response body

The transcription object or a verbose transcription object.

### Basic request

```bash
curl "https://api.deepinfra.com/v1/openai/audio/transcriptions" \
 -H "Content-Type: multipart/form-data" \
 -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
 -F file="@/path/to/file/audio.mp3" \
 -F model="mistralai/Voxtral-Small-24B-2507"
```
```json
{
 "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that."
}
```

### Word timestamp request

```bash
curl "https://api.deepinfra.com/v1/openai/audio/transcriptions" \
 -H "Content-Type: multipart/form-data" \
 -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
 -F file="@/path/to/file/audio.mp3" \
 -F model="mistralai/Voxtral-Small-24B-2507" \
 -F response_format="verbose_json" \
 -F "timestamp_granularities[]=word"
```

```json
{
 "task": "transcribe",
 "language": "english",
 "duration": 8.470000267028809,
 "text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",
 "words": [
 {
 "word": "The",
 "start": 0.0,
 "end": 0.23999999463558197
 },
 ...
 {
 "word": "volleyball",
 "start": 7.400000095367432,
 "end": 7.900000095367432
 }
 ]
}
```

### Segment timestamp request

```bash
curl "https://api.deepinfra.com/v1/openai/audio/transcriptions" \
 -H "Content-Type: multipart/form-data" \
 -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
 -F file="@/path/to/file/audio.mp3" \
 -F model="mistralai/Voxtral-Small-24B-2507" \
 -F response_format="verbose_json" \
 -F "timestamp_granularities[]=segment"
```

```json
{
 "task": "transcribe",
 "language": "english",
 "duration": 8.470000267028809,
 "text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",
 "segments": [
 {
 "id": 0,
 "seek": 0,
 "start": 0.0,
 "end": 3.319999933242798,
 "text": " The beach was a popular spot on a hot summer day.",
 "tokens": [
 50364, 440, 7534, 390, 257, 3743, 4008, 322, 257, 2368, 4266, 786, 13, 50530
 ],
 "temperature": 0.0,
 "avg_logprob": -0.2860786020755768,
 "compression_ratio": 1.2363636493682861,
 "no_speech_prob": 0.00985979475080967
 },
 ...
 ]
}
```

# Create translation

For a given audio file and model, the endpoint will return the translated text to English.

## Request body

- **file** (Required): The audio file object to translate. Supported formats are `flac`, `mp3`, `mp4`, `mpeg`, `mpga`, `m4a`, `ogg`, `wav`, and `webm`.
- **model** (Required): ID of the model to use. Only `mistralai/Voxtral-Small-24B-2507` for this case. For other models, refer to [models/automatic-speech-recognition](models/automatic-speech-recognition).
- **prompt** (Optional): An optional text to guide the model's style or continue a previous audio segment. The prompt should be in English.
- **response_format** (Optional): The format of the output. Options include: `json` (default), `text`, `srt`, `verbose_json`, `vtt`.
- **temperature** (Optional): The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

## Response body

The translated text to English.

### Basic request

```bash
curl "https://api.deepinfra.com/v1/openai/audio/translations" \
 -H "Content-Type: multipart/form-data" \
 -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
 -F file="@/path/to/file/german.m4a" \
 -F model="mistralai/Voxtral-Small-24B-2507"
```

```json
{
 "text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}
```

You can use OpenAI's Python SDK to interact with our OpenAI Transcriptions and Translations compatible endpoint.

# Create transcription

For a given audio file and model, the endpoint will return the **transcription object** or a **verbose transcription object**.

## Request body

- **file** (Required): The audio file object to transcribe. Supported formats are `flac`, `mp3`, `mp4`, `mpeg`, `mpga`, `m4a`, `ogg`, `wav`, and `webm`.
- **model** (Required): ID of the model to use. Only `mistralai/Voxtral-Small-24B-2507` for this case. For other models, refer to [models/automatic-speech-recognition](models/automatic-speech-recognition).
- **language** (Optional): The language of the input audio. Supplying the input language in ISO-639-1 format can improve accuracy and latency.
- **prompt** (Optional): An optional text prompt to guide the model's style or continue a previous audio segment. The prompt should match the audio language.
- **response_format** (Optional): The format of the output. Options include: `json` (default), `text`, `srt`, `verbose_json`, `vtt`.
- **temperature** (Optional): Controls the sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 make it more focused and deterministic. If set to 0, the model will adjust automatically to increase temperature as needed.
- **timestamp_granularities[]** (Optional): Specifies the timestamp granularity for transcription. Requires `response_format` to be set to `verbose_json`. Options: `word` - generates timestamps for individual words, `segment` - generates timestamps for segments. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.

## Response body

The transcription object or a verbose transcription object.

### Example

```python
from openai import OpenAI
client = OpenAI(
 api_key="$DEEPINFRA_TOKEN",
 base_url="https://api.deepinfra.com/v1/openai",
)

audio_file = open("speech.mp3", "rb")
transcript = client.audio.transcriptions.create(
 model="whisper-1",
 file=audio_file
)
```
```json
{
 "text": "Imagine the wildest idea that you've ever had, and you're curious about how it might scale to something that's a 100, a 1,000 times bigger. This is a place where you can get to do that."
}
```

# Create translation

For a given audio file and model, the endpoint will return the translated text to English.

## Request body

- **file** (Required): The audio file object to translate. Supported formats are `flac`, `mp3`, `mp4`, `mpeg`, `mpga`, `m4a`, `ogg`, `wav`, and `webm`.
- **model** (Required): ID of the model to use. Only `mistralai/Voxtral-Small-24B-2507` for this case. For other models, refer to [models/automatic-speech-recognition](models/automatic-speech-recognition).
- **prompt** (Optional): An optional text to guide the model's style or continue a previous audio segment. The prompt should be in English.
- **response_format** (Optional): The format of the output. Options include: `json` (default), `text`, `srt`, `verbose_json`, `vtt`.
- **temperature** (Optional): The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

## Response body

The translated text to English.

### Basic request

```python
from openai import OpenAI
client = OpenAI(
 api_key="$DEEPINFRA_TOKEN",
 base_url="https://api.deepinfra.com/v1/openai",
)

audio_file = open("speech.mp3", "rb")
transcript = client.audio.translations.create(
 model="whisper-1",
 file=audio_file
)
```

```json
{
 "text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}
```

You can use OpenAI's JavaScript SDK to interact with our OpenAI Transcriptions and Translations compatible endpoint.

```bash
npm install openai
```

# Create transcription

For a given audio file and model, the endpoint will return the **transcription object** or a **verbose transcription object**.

## Request body

- **file** (Required): The audio file object to transcribe. Supported formats are `flac`, `mp3`, `mp4`, `mpeg`, `mpga`, `m4a`, `ogg`, `wav`, and `webm`.
- **model** (Required): ID of the model to use. Only `mistralai/Voxtral-Small-24B-2507` for this case. For other models, refer to [models/automatic-speech-recognition](models/automatic-speech-recognition).
- **language** (Optional): The language of the input audio. Supplying the input language in ISO-639-1 format can improve accuracy and latency.
- **prompt** (Optional): An optional text prompt to guide the model's style or continue a previous audio segment. The prompt should match the audio language.
- **response_format** (Optional): The format of the output. Options include: `json` (default), `text`, `srt`, `verbose_json`, `vtt`.
- **temperature** (Optional): Controls the sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 make it more focused and deterministic. If set to 0, the model will adjust automatically to increase temperature as needed.
- **timestamp_granularities[]** (Optional): Specifies the timestamp granularity for transcription. Requires `response_format` to be set to `verbose_json`. Options: `word` - generates timestamps for individual words, `segment` - generates timestamps for segments. Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.

## Response body

The transcription object or a verbose transcription object.

### Example

```js
import fs from "fs";
import OpenAI from "openai";

const openai = new OpenAI({
 baseURL: 'https://api.deepinfra.com/v1/openai',
 apiKey: "$DEEPINFRA_TOKEN",
});

async function main() {
 const translation = await openai.audio.translations.create({
 file: fs.createReadStream("speech.mp3"),
 model: "mistralai/Voxtral-Small-24B-2507",
 });

 console.log(translation.text);
}
main();
```
```json
{
 "text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}
```

# Create translation

For a given audio file and model, the endpoint will return the translated text to English.

## Request body

- **file** (Required): The audio file object to translate. Supported formats are `flac`, `mp3`, `mp4`, `mpeg`, `mpga`, `m4a`, `ogg`, `wav`, and `webm`.
- **model** (Required): ID of the model to use. Only `mistralai/Voxtral-Small-24B-2507` for this case. For other models, refer to [models/automatic-speech-recognition](models/automatic-speech-recognition).
- **prompt** (Optional): An optional text to guide the model's style or continue a previous audio segment. The prompt should be in English.
- **response_format** (Optional): The format of the output. Options include: `json` (default), `text`, `srt`, `verbose_json`, `vtt`.
- **temperature** (Optional): The sampling temperature, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. If set to 0, the model will use log probability to automatically increase the temperature until certain thresholds are hit.

## Response body

The translated text to English.

### Basic request

```js
import fs from "fs";
import OpenAI from "openai";

const openai = new OpenAI({
 baseURL: 'https://api.deepinfra.com/v1/openai',
 apiKey: "$DEEPINFRA_TOKEN",
});

async function main() {
 const translation = await openai.audio.translations.create({
 file: fs.createReadStream("speech.mp3"),
 model: "mistralai/Voxtral-Small-24B-2507",
 });

 console.log(translation.text);
}
main();
```

```json
{
 "text": "Hello, my name is Wolfgang and I come from Germany. Where are you heading today?"
}
```

The service tier used for processing the request. 'priority' processes the request with higher priority (premium rate); 'flex' processes it at lower priority for a discount, served only when spare capacity exists and may be retried/timed out under load. Both apply only to models that support the respective tier.

service_tier

audio

task

optional text to provide as a prompt for the first window.

initial_prompt

temperature

language that the audio is in; uses detected language if None; use two letter language code (ISO 639-1) (e.g. en, de, ja)

language

chunk_level

chunk_length_s

The webhook to call when inference is done, by default you will get the output in the response of your inference request

webhook

ServiceTier

Audio

Chunk Length S

Chunk Level

Initial Prompt

Language

Task

Temperature

Webhook

AutomaticSpeechRecognitionIn

estimated cost billed for the request in USD

Cost

Output Length

Runtime Ms

Status

Tokens Generated

Tokens Input

InferenceReplyStatus

Avg Logprob

Compression Ratio

confidence of the segment (Only in whisper-timestamped model)

Confidence

end location in input in seconds from start

No Speech Prob

Seek

start location in input in seconds from start

Start

Text

Tokens

Segment

Word

Duration

Object containing the status of the inference request

Input Length Ms

Request Id

Segments

a list of timestamped words in a segment (Only in whisper-timestamped model)

Words

AutomaticSpeechRecognitionOut

model

file

An optional text to guide the model's style or continue a previous audio segment.

prompt

response_format

The sampling temperature, between 0 and 1. Higher values produce more creative results.

An array specifying the granularity of timestamps to include in the transcription. Possible values are 'segment', 'word'.

Voxtral-Small-24B-2507

HTTP/cURL API

Input fields

Input Schema

Output Schema