GLM-5.2 is Z-AI's latest flagship model for long-horizon tasks. It marks a substantial leap in long-horizon task capability over its predecessor GLM-5.1 and, for the first time, delivers that capability on a **solid 1M-token context**.

GLM-5.2

Kimi K2.7 Code is a coding-focused agentic model built upon Kimi K2.6. With substantial improvements on real-world long-horizon coding tasks, it strengthens end-to-end task completion across complex software engineering workflows while improving token efficiency, reducing thinking-token usage by approximately 30% compared with Kimi K2.6.

Kimi-K2.7-Code

Nemotron 3 Ultra is built for, frontier reasoning, orchestration, coding agents, deep research, and complex enterprise workflows. It delivers up to 5x faster inference and up to 30% lower cost for agentic workloads while supporting up to 1M token context.

NVIDIA-Nemotron-3-Ultra-550B-A55B

DeepSeek V4 Flash is an efficiency-focused MoE model with 284B total parameters (13B active) and a 1M-token context window. It's tuned for fast inference and high-throughput use cases while still holding up on reasoning and coding tasks.

DeepSeek-V4-Flash

DeepSeek V4 Pro is an MoE model with 1.6T total parameters (49B active) and a 1M-token context window. It's built for advanced reasoning, coding, and long-running agent tasks, and performs well on knowledge, math, and software engineering benchmarks.

DeepSeek-V4-Pro

Kimi K2.6 is an open-source, native multimodal agentic model that advances practical capabilities in long-horizon coding, coding-driven design, proactive autonomous execution, and swarm-based task orchestration.

Kimi-K2.6

MiMo-V2.5-Pro is an open-source Mixture-of-Experts (MoE) language model with 1.02T total parameters and 42B active parameters. It utilizes the hybrid attention architecture and 3-layers Multi-Token Prediction (MTP) introduced in [MiMo-V2-Flash](https://github.com/XiaomiMiMo/MiMo-V2-Flash).

MiMo-V2.5-Pro

Qwen3.6-35B-A3B is Alibaba's latest flagship Mixture-of-Experts model, with 35B total parameters and only 3B activated per token (256 experts, 8 routed + 1 shared). Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience.

Qwen3.6-35B-A3B

GLM-5.1 is Z-AI's next-generation flagship model for agentic engineering, with significantly stronger coding capabilities than its predecessor. It achieves state-of-the-art performance on SWE-Bench Pro and leads GLM-5 by a wide margin on NL2Repo (repo generation) and Terminal-Bench 2.0 (real-world terminal tasks).

GLM-5.1

Qwen3.5-397B-A17B is Alibaba's most capable Qwen3.5 model, a Mixture-of-Experts architecture with 397B total parameters and 17B activated per token. It features a 262K token context window (extensible to 1M with YaRN), thinking/reasoning mode, tool calling with MCP integration, and support for 201 languages. Sets state-of-the-art results on reasoning, coding, math, and multimodal benchmarks.

Qwen3.5-397B-A17B

Efficient, MoE variant of Gemma 4. Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input and generating text output.

gemma-4-26B-A4B-it

Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input and generating text output.

gemma-4-31B-it

NVIDIA Nemotron 3 Super is a hybrid Mixture-of-Experts (MoE) model engineered for highest compute efficiency and accuracy in multi-agent applications and specialized agentic systems. It is optimized to run many collaborating agents per application on a single GPU, delivering high accuracy for reasoning, tool use, and instruction following.

NVIDIA-Nemotron-3-Super-120B-A12B

GLM-5 is an advanced, open-source large language model designed for developers tackling the toughest challenges. It excels at long-context reasoning, multi-step tool orchestration, and complex systems engineering, making it the ideal choice for powering sophisticated agents and applications that require high-level cognitive tasks.

GLM-5

  Qwen3-TTS is an advanced text-to-speech model by Alibaba's Qwen team, delivering stable, expressive, and low-latency speech generation across 10 languages.                                                                                                                                                                                                                                                                                                                                           Key capabilities:                                                                                                                                                                                                                                  - 9 preset voices — Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee — covering diverse genders, ages, and accents                                                                                                              - Voice cloning — clone any voice from a short (~3s) audio sample via the voice_id parameter   - Instruction control — adjust tone, emotion, and speaking style with natural language (e.g. "speak slowly and calmly", "excited tone")   - 10 languages — English, Chinese, Japanese, Korean, German, French, Russian, Spanish, Italian, Portuguese   - Streaming support — real-time PCM streaming with ~97ms first-byte latency   - Multiple output formats — WAV, MP3, FLAC, PCM    Built on a 1.7B parameter architecture using discrete multi-codebook language modeling for end-to-end speech synthesis without cascading errors. Uses a custom 12Hz acoustic tokenizer that preserves paralinguistic information and   environmental audio details.

Qwen3-TTS

● Qwen3-TTS-VoiceDesign is a voice design variant of Qwen3-TTS by Alibaba's Qwen team. Instead of selecting from preset voices, you describe the voice you want in natural language — and the model generates speech in that voice.                                                                                                                                                                                                                                                                     Key capabilities:                                                                                                                                                                                                                                  - Natural language voice control — describe any voice with free text (e.g. "a deep male voice with a calm, authoritative presence", "a young cheerful female with a warm and friendly tone")   - 10 languages — English, Chinese, Japanese, Korean, German, French, Russian, Spanish, Italian, Portuguese                                                                                                                                         - Streaming support — real-time PCM streaming   - Multiple output formats — WAV, MP3, FLAC, PCM    Built on the same 1.7B parameter architecture as Qwen3-TTS, using discrete multi-codebook language modeling and a custom 12Hz acoustic tokenizer for high-quality end-to-end speech synthesis.

Qwen3-TTS-VoiceDesign

The latest flagship model in the Qwen family. State-of-the-art results across a comprehensive suite of benchmarks — including knowledge, reasoning, coding, instruction following, human preference alignment, agent tasks, and multilingual understanding.

Qwen3-Max

The latest flagship reasoning model in the Qwen3 family. Further enhanced by multiple innovations like adaptive tool-use and advanced test-time scaling techniques

Qwen3-Max-Thinking

Kimi K2.5 is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.

Kimi-K2.5

GLM-4.7-Flash is a 30B-A3B MoE model. As the strongest model in the 30B class, GLM-4.7-Flash offers a new option for lightweight deployment that balances performance and efficiency.

GLM-4.7-Flash

DeepSeek-V3.2 is a large language model designed to harmonize high computational efficiency with strong reasoning and agentic tool-use performance. It introduces DeepSeek Sparse Attention (DSA), a fine-grained sparse attention mechanism that reduces training and inference cost while preserving quality in long-context scenarios. A scalable reinforcement learning post-training framework further improves reasoning, with reported performance in the GPT-5 class, and the model has demonstrated gold-medal results on the 2025 IMO and IOI. V3.2 also uses a large-scale agentic task synthesis pipeline to better integrate reasoning into tool-use settings, boosting compliance and generalization in interactive environments.

DeepSeek-V3.2

The fastest model of the Flux 2 family. Frontier visual intelligence — state-of-the-art image generation and editing from Black Forest Labs

FLUX-2-klein-4b

The best quality-to-latency ratio, production apps model of the Flux 2 family. Frontier visual intelligence — state-of-the-art image generation and editing from Black Forest Labs

FLUX-2-klein-9b

claude

Claude

deepseek

DeepSeek

flux

Flux

gemini

Gemini

llama

Llama

mistral

Mistral

nemotron

Nemotron

qwen

Qwen

Realtime TTS 2.0 is a low-latency text-to-speech model with natural language steering, allowing you to control tone and emotion directly in the prompt (e.g., “[be happy and upbeat] Hello!”). It supports cross-lingual voices and multiple languages, enabling the same voice to speak consistently across different languages. This is an early access preview ahead of full launch, with ongoing improvements to voice quality and steering.

You can use cURL or any other http client to run inferences:

```bash
curl -X POST \
    -d '{"text": "The quick brown fox jumps over the lazy dog"}'  \
    -H "Authorization: bearer $DEEPINFRA_TOKEN"  \
    -H 'Content-Type: application/json'  \
    'https://api.deepinfra.com/v1/inference/inworld-ai/realtime-tts-2'
```

which will give you back something similar to:

```json
{
  "audio": null,
  "input_character_length": 0,
  "output_format": "",
  "words": [
    {
      "end": 1.0,
      "start": 0.0,
      "text": "Hello"
    },
    {
      "end": 5.0,
      "start": 4.0,
      "text": "World"
    }
  ],
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0,
    "output_length": 0
  }
}

```


You can use our command-line tool [deepctl](/docs/advanced/deepctl) to run
inferences:

```bash
deepctl infer \
    -m 'inworld-ai/realtime-tts-2'  \
    -i 'text=The quick brown fox jumps over the lazy dog'
```

which will give you back something similar to:

```json
{
  "audio": null,
  "input_character_length": 0,
  "output_format": "",
  "words": [
    {
      "end": 1.0,
      "start": 0.0,
      "text": "Hello"
    },
    {
      "end": 5.0,
      "start": 4.0,
      "text": "World"
    }
  ],
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0,
    "output_length": 0
  }
}

```


DeepInfra supports custom voices.

## Create voice

The following creates a voice using the `curl` command.

```bash
curl -X POST "https://api.deepinfra.com/v1/voices/add" \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -F "audio=@hello.wav" \
  -F "name=John Doe" \
  -F "description=John Doe's voice"
```

which will return something similar to

```json
{
  "user_id": "gh:10000000", 
  "voice_id": "abcd1234abcd1234abcd",
  "name": "John Doe",
  "description": "John Doe's voice",
  "created_at": 1723851387,
  "updated_at": 1723851387
}
```


We try to be eleventlabs python library compatible. Please reach out to feedback@deepinfra.com if you encounter any issues.
```python
from elevenlabs import ElevenLabs, play

# Initialize the ElevenLabs client with overridden api_key and base_url
client = ElevenLabs(api_key="$DEEPINFRA_TOKEN", base_url="https://api.deepinfra.com")

# Define the voice data
voice_name = "John Doe"
voice_description = "John Doe's voice"
audio_file_path = "test_audio.wav"

# Create the voice by cloning using the ElevenLabs client
cloned_voice = client.clone(
    name=voice_name,
    description=voice_description,
    files=[audio_file_path],
    labels="",
)

# Use the voice_id to generate speech
audio = client.generate(
    text="Hello, how are you?",
    voice=cloned_voice.voice_id,
    model="deepinfra/tts",
    output_format="wav",
)

play(audio)
```


The following creates a voice using the `axios` library in JavaScript.

```javascript
const axios = require('axios');
const FormData = require('form-data');
const fs = require('fs');

// Define the API endpoint
const url = "https://api.deepinfra.com/v1/voices/add";

// Create a FormData instance
const formData = new FormData();

// Append the audio file, name, and description to the form data
formData.append('files', fs.createReadStream('test_audio.wav'));
formData.append('name', 'John Doe');
formData.append('description', "John Doe's voice");

// Set the headers, including authorization and content type
const headers = {
    "Authorization": "Bearer $DEEPINFRA_TOKEN",
    ...formData.getHeaders()
};

// Send the POST request
axios.post(url, formData, { headers: headers })
    .then(response => {
        console.log("Voice created successfully!");
        console.log("Response:", response.data);
    })
    .catch(error => {
        console.error("Failed to create voice.");
        console.error("Status Code:", error.response.status);
        console.error("Response:", error.response.data);
    });
```

## Read voice

The following reads a voice using the `curl` command.

```bash
curl -X GET "https://api.deepinfra.com/v1/voices/abcd1234abcd1234abcd" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN"
```

which will return something similar to

```json
{
  "user_id": "gh:10000000", 
  "voice_id": "abcd1234abcd1234abcd",
  "name": "John Doe",
  "description": "John Doe's voice",
  "created_at": 1723851387,
  "updated_at": 1723851387
}
```


## Read voice

The following reads a voice using the `elevenlabs` library in Python. If you encounter any issues, please contact us at feedback@deepinfra.com.
```python
from elevenlabs import ElevenLabsClient

client = ElevenLabsClient(api_key="$DEEPINFRA_TOKEN", base_url="https://api.deepinfra.com")

voice = client.voices.get(voice_id="abcd1234abcd1234abcd")
print(voice)
```


## Read voice

The following reads a voice using JavaScript with the `fetch` API.

```javascript
const url = "https://api.deepinfra.com/v1/voices/abcd1234abcd1234abcd";
const headers = {
  "Content-Type": "application/json",
  "Authorization": "Bearer $DEEPINFRA_TOKEN"
};

fetch(url, { method: "GET", headers: headers })
  .then(response => response.json())
  .then(data => console.log(data))
  .catch(error => console.error('Error:', error));
```

## Update voice

The following updates a voice using the `curl` command.

```bash
curl -X POST "https://api.deepinfra.com/v1/voices/abcd1234abcd1234abcd/edit" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -F "name=Jane Doe" \
  -F "description=Jane Doe's voice"
```

which will return something similar to

```json
{
  "user_id": "gh:10000000", 
  "voice_id": "abcd1234abcd1234abcd",
  "name": "Jane Doe",
  "description": "Jane Doe's voice",
  "created_at": 1723851387,
  "updated_at": 1723851387
}
```


## Update voice
We support elevenlabs client for python. If you encounter any issues, please contact us at feedback@deepinfra.com.
```python
from elevenlabs import ElevenLabsClient

client = ElevenLabsClient(api_key="$DEEPINFRA_TOKEN", base_url="https://api.deepinfra.com")

client.voices.edit(voice_id="abcd1234abcd1234abcd", name="Jane Doe", description="Jane Doe's voice")
```


## Update voice

Update a voice using the `axios` library in JavaScript.

```javascript
const axios = require('axios');
const FormData = require('form-data');

const url = "https://api.deepinfra.com/v1/voices/abcd1234abcd1234abcd/edit";
const formData = new FormData();
formData.append('name', 'John Doe');
formData.append('description', "John Doe's voice");

const headers = {
    "Authorization": "Bearer $DEEPINFRA_TOKEN",
    ...formData.getHeaders()
};

axios.post(url, formData, { headers: headers })
    .then(response => {
        console.log("Voice updated successfully!");
        console.log("Response:", response.data);
    })
    .catch(error => {
        console.error("Failed to update voice.");
        console.error("Status Code:", error.response.status);
        console.error("Response:", error.response.data);
    });
```


## Delete voice

The following deletes a voice using the `curl` command.

```bash
curl -X DELETE "https://api.deepinfra.com/v1/voices/abcd1234abcd1234abcd" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN"
```

which will return 200 OK status code.


## Delete voice


```python
from elevenlabs import ElevenLabsClient

client = ElevenLabsClient(api_key="$DEEPINFRA_TOKEN", base_url="https://api.deepinfra.com")

client.voices.delete(voice_id="abcd1234abcd1234abcd")
```


## Delete voice

The following deletes a voice using the `fetch` API in JavaScript.

```javascript
const url = "https://api.deepinfra.com/v1/voices/abcd1234abcd1234abcd";
const headers = {
  "Content-Type": "application/json",
  "Authorization": `Bearer $DEEPINFRA_TOKEN`
};

fetch(url, {
  method: "DELETE",
  headers: headers
})
.then(response => {
  if (response.ok) {
    console.log("Voice deleted successfully.");
  } else {
    throw new Error(`Failed to delete voice: ${response.status}`);
  }
})
.catch(error => console.error(error));
```


## List voices
The following lists voices using the `curl` command.

```bash
curl -X GET "https://api.deepinfra.com/v1/voices" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
```

which will return something similar to

```json
{
  "voices": [
    {
      "user_id": "gh:10000000", 
      "voice_id": "abcd1234abcd1234abcd",
      "name": "John Doe",
      "description": "John Doe's voice",
      "created_at": 1723851387,
      "updated_at": 1723851387
    },
    {
      "user_id": "gh:10000000",
      "voice_id":"abcd1234abcd1234abc1",
      "name": "Jane Doe",
      "description": "Jane Doe's voice",
      "created_at": 1723680057,
      "updated_at": 1723680057
    }
  ]
}
```


## List voices

The following lists voices using the `elevenlabs` client library in Python. If you encounter any issues, please reach out to feedback@deepinfra.com.

```python
from elevenlabs import ElevenLabs

client = ElevenLabs(api_key="$DEEPINFRA_TOKEN", base_url="https://api.deepinfra.com")

response = client.voices.get_all()

for voice in response.voices:
    print(voice.voice_id)
```


## List voices

The following lists voices using the `fetch` API in JavaScript.

```javascript
const url = "https://api.deepinfra.com/v1/voices";
const headers = {
  "Content-Type": "application/json",
  "Authorization": `Bearer $DEEPINFRA_TOKEN`
};

fetch(url, {
  method: "GET",
  headers: headers
})
.then(response => {
  if (response.ok) {
    return response.json();
  } else {
    throw new Error(`Failed to list voices: ${response.status}`);
  }
})
.then(data => console.log(data))
.catch(error => console.error(error));
```


The DeepInfra OpenAI-compatible Speech API endpoint enables users to effortlessly convert text into speech audio. This document outlines how to integrate and utilize this endpoint to quickly create speech from text inputs, leveraging various audio output formats.

## Create speech

Use the following example of pythong code to generate an audio file from your text input:

```python
from pathlib import Path
from openai import OpenAI

client = OpenAI(base_url="https://api.deepinfra.com/v1/openai",
                api_key="$DEEPINFRA_TOKEN")

speech_file_path = Path(__file__).parent / "speech.mp3"
with client.audio.speech.with_streaming_response.create(
  model="inworld-ai/realtime-tts-2",
  voice="Ashley",
  input="The quick brown fox jumped over the lazy dog.",
  response_format="mp3",
) as response:
  response.stream_to_file(speech_file_path)
```

The API returns the generated audio file in the requested format (e.g., `mp3`, `pcm`). The example above saves the audio output directly as `speech.mp3`.

### Supported Audio Formats

| Format | Description                           | Recommended Usage                          |
|--------|---------------------------------------|--------------------------------------------|
| pcm    | Raw, uncompressed audio               | *Best for lowest latency streaming*        |
| mp3    | Compressed audio suitable for storage | General use and file distribution          |
| opus   | Compressed audio ideal for streaming  | Efficient streaming and real-time playback |
| flac   | Lossless audio compression            | High-quality archival storage              |
| wav    | Uncompressed audio                    | High-quality audio processing tasks        |

### Performance Recommendations

- **For lowest latency (fastest initial audio chunk):**
  - Use the `pcm` output format.

- **For general purposes:**
  - Use the `mp3` or `opus` formats for optimal balance between quality and file size.



The DeepInfra OpenAI-compatible Speech API endpoint enables users to effortlessly convert text into speech audio. This document outlines how to integrate and utilize this endpoint to quickly create speech from text inputs, leveraging various audio output formats.

## Create speech

Use the following example of js code to generate an audio file from your text input:

```javascript
import fs from "fs";
import path from "path";
import OpenAI from "openai";

const openai = new OpenAI(base_url="https://api.deepinfra.com/v1/openai",
                          api_key="$DEEPINFRA_TOKEN");

const speechFile = path.resolve("./speech.mp3");

async function main() {
  const mp3 = await openai.audio.speech.create({
    model: "inworld-ai/realtime-tts-2",
    voice: "Ashley",
    input: "The quick brown fox jumped over the lazy dog.",
    response_format: "mp3",
  });
  console.log(speechFile);
  const buffer = Buffer.from(await mp3.arrayBuffer());
  await fs.promises.writeFile(speechFile, buffer);
}
main();
```

The API returns the generated audio file in the requested format (e.g., `mp3`, `pcm`). The example above saves the audio output directly as `speech.mp3`.

### Supported Audio Formats

| Format | Description                           | Recommended Usage                          |
|--------|---------------------------------------|--------------------------------------------|
| pcm    | Raw, uncompressed audio               | *Best for lowest latency streaming*        |
| mp3    | Compressed audio suitable for storage | General use and file distribution          |
| opus   | Compressed audio ideal for streaming  | Efficient streaming and real-time playback |
| flac   | Lossless audio compression            | High-quality archival storage              |
| wav    | Uncompressed audio                    | High-quality audio processing tasks        |

### Performance Recommendations

- **For lowest latency (fastest initial audio chunk):**
  - Use the `pcm` output format.

- **For general purposes:**
  - Use the `mp3` or `opus` formats for optimal balance between quality and file size.



The DeepInfra OpenAI-compatible Speech API endpoint enables users to effortlessly convert text into speech audio. This document outlines how to integrate and utilize this endpoint to quickly create speech from text inputs, leveraging various audio output formats.

## Create speech

Use the following example `curl` request to generate an audio file from your text input:

```bash
curl https://api.deepinfra.com/v1/openai/audio/speech \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "inworld-ai/realtime-tts-2",
    "input": "The quick brown fox jumped over the lazy dog.",
    "voice": "Ashley",
    "response_format": "mp3"
  }' \
  --output speech.mp3
```

The API returns the generated audio file in the requested format (e.g., `mp3`, `pcm`). The example above saves the audio output directly as `speech.mp3`.

### Supported Audio Formats

| Format | Description                           | Recommended Usage                          |
|--------|---------------------------------------|--------------------------------------------|
| pcm    | Raw, uncompressed audio               | *Best for lowest latency streaming*        |
| mp3    | Compressed audio suitable for storage | General use and file distribution          |
| opus   | Compressed audio ideal for streaming  | Efficient streaming and real-time playback |
| flac   | Lossless audio compression            | High-quality archival storage              |
| wav    | Uncompressed audio                    | High-quality audio processing tasks        |

### Performance Recommendations

- **For lowest latency (fastest initial audio chunk):**
  - Use the `pcm` output format.

- **For general purposes:**
  - Use the `mp3` or `opus` formats for optimal balance between quality and file size.



The DeepInfra ElevenLabs-compatible Speech API endpoint allows users to seamlessly convert text inputs into high-quality speech audio. This document provides guidance on integrating and using this endpoint to generate realistic audio files efficiently.

Use the following examples of `py` request to generate an audio file from your text input:

## Create Speech (non-streaming)

```bash
from elevenlabs import ElevenLabs

client = ElevenLabs(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/",
)
client.text_to_speech.convert(
    voice_id="Ashley",
    output_format="mp3",
    text="The quick brown fox jumped over the lazy dog.",
    model_id="inworld-ai/realtime-tts-2",
)
```

## Create Speech with Streaming

```bash
from elevenlabs import ElevenLabs

client = ElevenLabs(
    api_key="$DEEPINFRA_TOKEN",
)
client.text_to_speech.convert_as_stream(
    voice_id="Ashley",
    output_format="pcm",
    text="The quick brown fox jumped over the lazy dog.",
    model_id="inworld-ai/realtime-tts-2",
)
```

The API returns the generated audio in the requested format, such as `mp3` or `pcm`. The example above saves the audio directly as `speech.mp3`.

### Supported Audio Formats

| Format | Description                           | Recommended Usage                          |
|--------|---------------------------------------|--------------------------------------------|
| pcm    | Raw, uncompressed audio               | *Lowest latency streaming scenarios*       |
| mp3    | Compressed audio suitable for storage | General use and easy file sharing          |
| opus   | Compressed audio ideal for streaming  | Real-time audio streaming applications     |
| flac   | Lossless audio compression            | Archival storage and high-fidelity needs   |
| wav    | Uncompressed audio                    | Audio processing and editing tasks         |

### Performance Recommendations

- **Lowest latency:** Using the `pcm` output format for streaming applications is **HIGHLY RECOMMENDED**.
- **General use:** Use `mp3` or `opus` formats for the best trade-off between quality and file size.



The DeepInfra ElevenLabs-compatible Speech API endpoint allows users to seamlessly convert text inputs into high-quality speech audio. This document provides guidance on integrating and using this endpoint to generate realistic audio files efficiently.

Use the following examples of `js` request to generate an audio file from your text input:

## Create Speech (non-streaming)

```bash
import { ElevenLabsClient } from "elevenlabs";

const client = new ElevenLabsClient({ apiKey: "$DEEPINFRA_TOKEN", base_url: "https://api.deepinfra.com/" });
await client.textToSpeech.convert("Ashley", {
    output_format: "mp3",
    text: "The quick brown fox jumped over the lazy dog.",
    model_id: "inworld-ai/realtime-tts-2"
});
```

## Create Speech with Streaming

```bash
import { ElevenLabsClient } from "elevenlabs";

const client = new ElevenLabsClient({ apiKey: "$DEEPINFRA_TOKEN" });
await client.textToSpeech.convert("Ashley", {
    output_format: "pcm",
    text: "The quick brown fox jumped over the lazy dog.",
    model_id: "inworld-ai/realtime-tts-2"
});
```

The API returns the generated audio in the requested format, such as `mp3` or `pcm`. The example above saves the audio directly as `speech.mp3`.

### Supported Audio Formats

| Format | Description                           | Recommended Usage                          |
|--------|---------------------------------------|--------------------------------------------|
| pcm    | Raw, uncompressed audio               | *Lowest latency streaming scenarios*       |
| mp3    | Compressed audio suitable for storage | General use and easy file sharing          |
| opus   | Compressed audio ideal for streaming  | Real-time audio streaming applications     |
| flac   | Lossless audio compression            | Archival storage and high-fidelity needs   |
| wav    | Uncompressed audio                    | Audio processing and editing tasks         |

### Performance Recommendations

- **Lowest latency:** Using the `pcm` output format for streaming applications is **HIGHLY RECOMMENDED**.
- **General use:** Use `mp3` or `opus` formats for the best trade-off between quality and file size.



The DeepInfra ElevenLabs-compatible Speech API endpoint allows users to seamlessly convert text inputs into high-quality speech audio. This document provides guidance on integrating and using this endpoint to generate realistic audio files efficiently.

Use the following examples of `curl` request to generate an audio file from your text input:

## Create Speech (non-streaming)

```bash
curl -X POST "https://api.deepinfra.com/v1/text-to-speech/Ashley" \
     -H "xi-api-key: $DEEPINFRA_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
  "text": "The quick brown fox jumped over the lazy dog.",
  "model_id": "inworld-ai/realtime-tts-2",
  "output_format": "mp3",
}' --output speech.mp3
```

## Create Speech with Streaming

```bash
curl -X POST "https://api.deepinfra.com/v1/text-to-speech/Ashley/stream" \
     -H "xi-api-key: $DEEPINFRA_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
  "text": "The quick brown fox jumped over the lazy dog.",
  "model_id": "inworld-ai/realtime-tts-2",
  "output_format": "pcm",
}' --output speech.pcm
```

The API returns the generated audio in the requested format, such as `mp3` or `pcm`. The example above saves the audio directly as `speech.mp3`.

### Supported Audio Formats

| Format | Description                           | Recommended Usage                          |
|--------|---------------------------------------|--------------------------------------------|
| pcm    | Raw, uncompressed audio               | *Lowest latency streaming scenarios*       |
| mp3    | Compressed audio suitable for storage | General use and easy file sharing          |
| opus   | Compressed audio ideal for streaming  | Real-time audio streaming applications     |
| flac   | Lossless audio compression            | Archival storage and high-fidelity needs   |
| wav    | Uncompressed audio                    | Audio processing and editing tasks         |

### Performance Recommendations

- **Lowest latency:** Using the `pcm` output format for streaming applications is **HIGHLY RECOMMENDED**.
- **General use:** Use `mp3` or `opus` formats for the best trade-off between quality and file size.



The service tier used for processing the request. 'priority' processes the request with higher priority (premium rate); 'flex' processes it at lower priority for a discount, served only when spare capacity exists and may be retried/timed out under load. Both apply only to models that support the respective tier.

service_tier

text

Preset voice name (Ashley, Diego, etc.) or a voice_id from /v1/voices/add for voice cloning.

voice

output_format

speaking_rate

Temperature controls variability of the speech

temperature

sample_rate

return_timestamps

Language hint (e.g. "AUTO" or a specific code like "EN_US"). Supported by realtime-tts-2; ignored by 1.5 models.

language

stream

The webhook to call when inference is done, by default you will get the output in the response of your inference request

webhook

Select the desired voice for the speech output.

InworldTtsVoice

ServiceTier

Select the desired format for the speech output. Supported formats include mp3, opus, flac, wav, and pcm.

realtime-tts-2

Input

Output