The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications.

GLM-4.5

GLM-4.5-Air

Qwen3-Coder-480B-A35B-Instruct is the Qwen3's most agentic code model, featuring Significant Performance on Agentic Coding, Agentic Browser-Use and other foundational coding tasks, achieving results comparable to Claude Sonnet.

Qwen3-Coder-480B-A35B-Instruct-Turbo

Qwen3-Coder-480B-A35B-Instruct

Qwen3-235B-A22B-Thinking-2507 is the Qwen3's new model with scaling the thinking capability of Qwen3-235B-A22B, improving both the quality and depth of reasoning. 

Qwen3-235B-A22B-Thinking-2507

Kimi K2 is a large-scale Mixture-of-Experts (MoE) language model developed by Moonshot AI, featuring 1 trillion total parameters with 32 billion active per forward pass. It is optimized for agentic capabilities, including advanced tool use, reasoning, and code synthesis. Kimi K2 excels across a broad range of benchmarks, particularly in coding (LiveCodeBench, SWE-bench), reasoning (ZebraLogic, GPQA), and tool-use (Tau2, AceBench) tasks.

Kimi-K2-Instruct

Voxtral Small is an enhancement of Mistral Small 3, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.

Voxtral-Small-24B-2507

Voxtral Mini is an enhancement of Ministral 3B, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.

Voxtral-Mini-3B-2507

The DeepSeek R1 0528 turbo model is a state of the art reasoning model that can generate very quick responses

DeepSeek-R1-0528-Turbo

Qwen3-235B-A22B-Instruct-2507 is the updated version of the Qwen3-235B-A22B non-thinking mode, featuring Significant improvements in general capabilities, including instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage.  

Qwen3-235B-A22B-Instruct-2507

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support

Qwen3-30B-A3B

Qwen3-32B

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support. 

Qwen3-14B

DeepSeek-V3-0324-Turbo

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Maverick, a 17 billion parameter model with 128 experts

Llama-4-Maverick-17B-128E-Instruct-Turbo

Llama-4-Maverick-17B-128E-Instruct-FP8

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Scout, a 17 billion parameter model with 16 experts

Llama-4-Scout-17B-16E-Instruct

The DeepSeek R1 model has undergone a minor version upgrade, with the current version being DeepSeek-R1-0528.

DeepSeek-R1-0528

DeepSeek-V3-0324, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token, an improved iteration over DeepSeek-V3.

DeepSeek-V3-0324

Devstral is an agentic LLM for software engineering tasks, making it a great choice for software engineering agents.

Devstral-Small-2507

Mistral-Small-3.2-24B-Instruct is a drop-in upgrade over the 3.1 release, with markedly better instruction following, roughly half the infinite-generation errors, and a more robust function-calling interface—while otherwise matching or slightly improving on all previous text and vision benchmarks.

Mistral-Small-3.2-24B-Instruct-2506

Llama Guard 4 is a natively multimodal safety classifier with 12 billion parameters trained jointly on text and multiple images. Llama Guard 4 is a dense architecture pruned from the Llama 4 Scout pre-trained model and fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM inputs (prompt classification) and in LLM responses (response classification). It itself acts as an LLM: it generates text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated.

Llama-Guard-4-12B

QwQ is the reasoning model of the Qwen series. Compared with conventional instruction-tuned models, QwQ, which is capable of thinking and reasoning, can achieve significantly enhanced performance in downstream tasks, especially hard problems. QwQ-32B is the medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini.

QwQ-32B

Anthropic’s most powerful model yet and the state-of-the-art coding model. It delivers sustained performance on long-running tasks that require focused effort and thousands of steps, significantly expanding what AI agents can solve. Claude Opus 4 is ideal for powering frontier agent products and features.

claude-4-opus

Anthropic's mid-size model with superior intelligence for high-volume uses in coding, in-depth research, agents, & more.

claude-4-sonnet

Gemini 2.5 Flash is Google's latest thinking model, designed to tackle increasingly complex problems. It's capable of reasoning through their thoughts before responding, resulting in enhanced performance and improved accuracy.  Gemini 2.5 Flash: best for balancing reasoning and speed.

gemini-2.5-flash

Gemini 2.5 Pro is Google's the most advanced thinking model, designed to tackle increasingly complex problems. Gemini 2.5 Pro leads common benchmarks by meaningful margins and showcases strong reasoning and code capabilities.  Gemini 2.5 models are thinking models, capable of reasoning through their thoughts before responding, resulting in enhanced performance and improved accuracy.  The Gemini 2.5 Pro model is now available on DeepInfra.

gemini-2.5-pro

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3 27B is Google's latest open source model, successor to Gemma 2

gemma-3-27b-it

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3-12B is Google's latest open source model, successor to Gemma 2

gemma-3-12b-it

gemma-3-4b-it

Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.

Kokoro-82M

Orpheus TTS is a state-of-the-art, Llama-based Speech-LLM designed for high-quality, empathetic text-to-speech generation. This model has been finetuned to deliver human-level speech synthesis, achieving exceptional clarity, expressiveness, and real-time streaming performances.

orpheus-3b-0.1-ft

CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.

csm-1b

DeepSeek-R1-Distill-Llama-70B is a highly efficient language model that leverages knowledge distillation to achieve state-of-the-art performance. This model distills the reasoning patterns of larger models into a smaller, more agile architecture, resulting in exceptional results on benchmarks like AIME 2024, MATH-500, and LiveCodeBench. With 70 billion parameters, DeepSeek-R1-Distill-Llama-70B offers a unique balance of accuracy and efficiency, making it an ideal choice for a wide range of natural language processing tasks. 

DeepSeek-R1-Distill-Llama-70B

DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. 

DeepSeek-V3

Llama 3.3-70B Turbo is a highly optimized version of the Llama 3.3-70B model, utilizing FP8 quantization to deliver significantly faster inference speeds with a minor trade-off in accuracy. The model is designed to be helpful, safe, and flexible, with a focus on responsible deployment and mitigating potential risks such as bias, toxicity, and misinformation. It achieves state-of-the-art performance on various benchmarks, including conversational tasks, language translation, and text generation.

Llama-3.3-70B-Instruct-Turbo

Llama 3.3-70B is a multilingual LLM trained on a massive dataset of 15 trillion tokens, fine-tuned for instruction-following and conversational dialogue. The model is designed to be helpful, safe, and flexible, with a focus on responsible deployment and mitigating potential risks such as bias, toxicity, and misinformation. It achieves state-of-the-art performance on various benchmarks, including conversational tasks, language translation, and text generation.

Llama-3.3-70B-Instruct

Phi-4 is a model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning.

phi-4

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford  et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation.

whisper-large-v3-turbo

  At 2.5 billion parameters, with improved MMDiT-X architecture and training methods, this model is designed to run “out of the box” on consumer hardware, striking a balance between quality and ease of customization. It is capable of generating images ranging between 0.25 and 2 megapixel resolution. 

We recommend using our NodeJS client https://github.com/deepinfra/deepinfra-node.

You can install it with

```bash
npm install deepinfra
```

and then

```javascript
import { TextToImage } from "deepinfra";
import { createWriteStream } from "fs";
import { Readable } from "stream";

const DEEPINFRA_API_KEY = "$DEEPINFRA_TOKEN";
const MODEL = "stabilityai/sd3.5-medium";

const main = async () => {
  const model = new TextToImage(MODEL, DEEPINFRA_API_KEY);
  const response = await model.generate({
    prompt: "a burger with a funny hat on the beach",
  });

  const result = await fetch(response.images[0]);

  if (result.ok && result.body) {
    let writer = createWriteStream("image.png");
    Readable.fromWeb(result.body).pipe(writer);
  }
};

main();
```

Look into `image.png` to see the result.



This document provides an overview of the DeepInfra-compatible OpenAI image generation API. It allows users to generate AI-created images based on text prompts using DeepInfra models.



##  Image Generation

```bash
curl https://api.deepinfra.com/v1/openai/images/generations \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
    "prompt": "A photo of an astronaut riding a horse on Mars.",
    "size": "1024x1024",
    "model": "stabilityai/sd3.5-medium",
    "n": 1
    }'
```



The API returns a JSON object containing the generated image(s).

## Example Response
```json
{
  "created": 1707000000,
  "data": [
    {
      "revised_prompt": "A photo of an astronaut riding a horse on Mars.",
      "b64_json": "/9j/4AAQS..."
    }
  ]
}
```



You can use the official OpenAI Node.js client to generate images


## Installation
```bash
npm install openai
```

## Code Example
```javascript
import OpenAI from "openai";
import fs from "fs";

const openai = new OpenAI({
  baseURL: "https://api.deepinfra.com/v1/openai",
  apiKey: "$DEEPINFRA_TOKEN",
});

async function generateImage(prompt, outputPath) {
    const response = await openai.images.generate({
        prompt: prompt,
        model: "stabilityai/sd3.5-medium",
        n: 1,
        size: "1024x1024"
    });

    const base64Data = response.data[0].b64_json;
    const imageBuffer = Buffer.from(base64Data, "base64");

    fs.writeFileSync(outputPath, imageBuffer);
    console.log(`Image saved at: ${outputPath}`);
}

generateImage("A photo of an astronaut riding a horse on Mars.", "output.png");
```


You can use the official OpenAI Python client to generate images

## Installation
```bash
pip install openai
```

## Code Example

```python
import openai
import base64

def generate_image(prompt, output_path):
    client = openai.OpenAI(
        base_url="https://api.deepinfra.com/v1/openai",
        api_key="$DEEPINFRA_TOKEN"
    )

    response = client.images.generate(
        prompt=prompt,
        model="stabilityai/sd3.5-medium",
        n=1,
        size="1024x1024",
    )

    base64_data = response.data[0].b64_json
    image_data = base64.b64decode(base64_data)

    with open(output_path, "wb") as file:
        file.write(image_data)

    print(f"Image saved at: {output_path}")

generate_image("A photo of an astronaut riding a horse on Mars.", "output.png")
```


You can use cURL or any other http client to run inferences:

```bash
curl -X POST \
    -d '{"prompt": "A photo of an astronaut riding a horse on Mars."}'  \
    -H "Authorization: bearer $DEEPINFRA_TOKEN"  \
    -H 'Content-Type: application/json'  \
    'https://api.deepinfra.com/v1/inference/stabilityai/sd3.5-medium'
```

which will give you back something similar to:

```json
{
  "images": [
    "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAIAAACQd1PeAAAADElEQVQI12PQz3wAAAJDAXkkWn+MAAAAAElFTkSuQmCC"
  ],
  "nsfw_content_detected": [
    false
  ],
  "seed": 42,
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0
  }
}

```


You can use our command-line tool [deepctl](/docs/advanced/deepctl) to run
inferences:

```bash
deepctl infer \
    -m 'stabilityai/sd3.5-medium'  \
    -i 'prompt=A photo of an astronaut riding a horse on Mars.'
```

which will give you back something similar to:

```json
{
  "images": [
    "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAIAAACQd1PeAAAADElEQVQI12PQz3wAAAJDAXkkWn+MAAAAAElFTkSuQmCC"
  ],
  "nsfw_content_detected": [
    false
  ],
  "seed": 42,
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0
  }
}

```


prompt

negative_prompt

num_images

num_inference_steps

aspect_ratio

classifier-free guidance, higher means follow prompt more closely

guidance_scale

seed

The webhook to call when inference is done, by default you will get the output in the response of your inference request

webhook

Aspect Ratio

Guidance Scale

Negative Prompt

Num Images

Num Inference Steps

Prompt

Seed

Webhook

SD3In

estimated cost billed for the request in USD

Cost

Runtime Ms

Status

Tokens Generated

Tokens Input

InferenceReplyStatus

a list of images, encoded in data-URL (png) format

Images

Object containing the status of the inference request

a list of booleans indicating whether NSFW content was detected in the corresponding image

Nsfw Content Detected

Request Id

TextToImageOut

model

The format in which the generated images are returned. Currently only b64_json is supported.

response_format

The size of the generated images. Available sizes depend on the model.

size

A unique identifier representing your end-user, which can help to monitor and detect abuse.

stabilityai/sd3.5-medium

OpenAI-compatible HTTP API

Image Generation

Example Response

Input fields

`model`string

`n`integer

`response_format`string

`size`string

`user`string

`prompt`string

`quality`string

`style`string

Input Schema

Output Schema

Unlock the most affordable AI hosting

stabilityai/sd3.5-medium

OpenAI-compatible HTTP API

Image Generation

Example Response

Input fields

modelstring

ninteger

response_formatstring

sizestring

userstring

promptstring

qualitystring

stylestring