GLM-5.2 is Z-AI's latest flagship model for long-horizon tasks. It marks a substantial leap in long-horizon task capability over its predecessor GLM-5.1 and, for the first time, delivers that capability on a **solid 1M-token context**.

GLM-5.2

Kimi K2.7 Code is a coding-focused agentic model built upon Kimi K2.6. With substantial improvements on real-world long-horizon coding tasks, it strengthens end-to-end task completion across complex software engineering workflows while improving token efficiency, reducing thinking-token usage by approximately 30% compared with Kimi K2.6.

Kimi-K2.7-Code

Nemotron 3 Ultra is built for, frontier reasoning, orchestration, coding agents, deep research, and complex enterprise workflows. It delivers up to 5x faster inference and up to 30% lower cost for agentic workloads while supporting up to 1M token context.

NVIDIA-Nemotron-3-Ultra-550B-A55B

DeepSeek V4 Flash is an efficiency-focused MoE model with 284B total parameters (13B active) and a 1M-token context window. It's tuned for fast inference and high-throughput use cases while still holding up on reasoning and coding tasks.

DeepSeek-V4-Flash

DeepSeek V4 Pro is an MoE model with 1.6T total parameters (49B active) and a 1M-token context window. It's built for advanced reasoning, coding, and long-running agent tasks, and performs well on knowledge, math, and software engineering benchmarks.

DeepSeek-V4-Pro

Kimi K2.6 is an open-source, native multimodal agentic model that advances practical capabilities in long-horizon coding, coding-driven design, proactive autonomous execution, and swarm-based task orchestration.

Kimi-K2.6

MiMo-V2.5 is a native omnimodal model with strong agentic capabilities, supporting text, image, video, and audio understanding within a unified architecture. Built upon the MiMo-V2-Flash backbone and extended with dedicated vision and audio encoders, it delivers robust performance across multimodal perception, long-context reasoning, and agentic workflows. 

MiMo-V2.5

MiMo-V2.5-Pro is an open-source Mixture-of-Experts (MoE) language model with 1.02T total parameters and 42B active parameters. It utilizes the hybrid attention architecture and 3-layers Multi-Token Prediction (MTP) introduced in [MiMo-V2-Flash](https://github.com/XiaomiMiMo/MiMo-V2-Flash).

MiMo-V2.5-Pro

Qwen3.6-35B-A3B is Alibaba's latest flagship Mixture-of-Experts model, with 35B total parameters and only 3B activated per token (256 experts, 8 routed + 1 shared). Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience.

Qwen3.6-35B-A3B

GLM-5.1 is Z-AI's next-generation flagship model for agentic engineering, with significantly stronger coding capabilities than its predecessor. It achieves state-of-the-art performance on SWE-Bench Pro and leads GLM-5 by a wide margin on NL2Repo (repo generation) and Terminal-Bench 2.0 (real-world terminal tasks).

GLM-5.1

Qwen3.5-397B-A17B is Alibaba's most capable Qwen3.5 model, a Mixture-of-Experts architecture with 397B total parameters and 17B activated per token. It features a 262K token context window (extensible to 1M with YaRN), thinking/reasoning mode, tool calling with MCP integration, and support for 201 languages. Sets state-of-the-art results on reasoning, coding, math, and multimodal benchmarks.

Qwen3.5-397B-A17B

Efficient, MoE variant of Gemma 4. Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input and generating text output.

gemma-4-26B-A4B-it

Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input and generating text output.

gemma-4-31B-it

NVIDIA Nemotron 3 Super is a hybrid Mixture-of-Experts (MoE) model engineered for highest compute efficiency and accuracy in multi-agent applications and specialized agentic systems. It is optimized to run many collaborating agents per application on a single GPU, delivering high accuracy for reasoning, tool use, and instruction following.

NVIDIA-Nemotron-3-Super-120B-A12B

GLM-5 is an advanced, open-source large language model designed for developers tackling the toughest challenges. It excels at long-context reasoning, multi-step tool orchestration, and complex systems engineering, making it the ideal choice for powering sophisticated agents and applications that require high-level cognitive tasks.

GLM-5

  Qwen3-TTS is an advanced text-to-speech model by Alibaba's Qwen team, delivering stable, expressive, and low-latency speech generation across 10 languages.                                                                                                                                                                                                                                                                                                                                           Key capabilities:                                                                                                                                                                                                                                  - 9 preset voices — Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee — covering diverse genders, ages, and accents                                                                                                              - Voice cloning — clone any voice from a short (~3s) audio sample via the voice_id parameter   - Instruction control — adjust tone, emotion, and speaking style with natural language (e.g. "speak slowly and calmly", "excited tone")   - 10 languages — English, Chinese, Japanese, Korean, German, French, Russian, Spanish, Italian, Portuguese   - Streaming support — real-time PCM streaming with ~97ms first-byte latency   - Multiple output formats — WAV, MP3, FLAC, PCM    Built on a 1.7B parameter architecture using discrete multi-codebook language modeling for end-to-end speech synthesis without cascading errors. Uses a custom 12Hz acoustic tokenizer that preserves paralinguistic information and   environmental audio details.

Qwen3-TTS

● Qwen3-TTS-VoiceDesign is a voice design variant of Qwen3-TTS by Alibaba's Qwen team. Instead of selecting from preset voices, you describe the voice you want in natural language — and the model generates speech in that voice.                                                                                                                                                                                                                                                                     Key capabilities:                                                                                                                                                                                                                                  - Natural language voice control — describe any voice with free text (e.g. "a deep male voice with a calm, authoritative presence", "a young cheerful female with a warm and friendly tone")   - 10 languages — English, Chinese, Japanese, Korean, German, French, Russian, Spanish, Italian, Portuguese                                                                                                                                         - Streaming support — real-time PCM streaming   - Multiple output formats — WAV, MP3, FLAC, PCM    Built on the same 1.7B parameter architecture as Qwen3-TTS, using discrete multi-codebook language modeling and a custom 12Hz acoustic tokenizer for high-quality end-to-end speech synthesis.

Qwen3-TTS-VoiceDesign

The latest flagship model in the Qwen family. State-of-the-art results across a comprehensive suite of benchmarks — including knowledge, reasoning, coding, instruction following, human preference alignment, agent tasks, and multilingual understanding.

Qwen3-Max

The latest flagship reasoning model in the Qwen3 family. Further enhanced by multiple innovations like adaptive tool-use and advanced test-time scaling techniques

Qwen3-Max-Thinking

Kimi K2.5 is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.

Kimi-K2.5

GLM-4.7-Flash is a 30B-A3B MoE model. As the strongest model in the 30B class, GLM-4.7-Flash offers a new option for lightweight deployment that balances performance and efficiency.

GLM-4.7-Flash

DeepSeek-V3.2 is a large language model designed to harmonize high computational efficiency with strong reasoning and agentic tool-use performance. It introduces DeepSeek Sparse Attention (DSA), a fine-grained sparse attention mechanism that reduces training and inference cost while preserving quality in long-context scenarios. A scalable reinforcement learning post-training framework further improves reasoning, with reported performance in the GPT-5 class, and the model has demonstrated gold-medal results on the 2025 IMO and IOI. V3.2 also uses a large-scale agentic task synthesis pipeline to better integrate reasoning into tool-use settings, boosting compliance and generalization in interactive environments.

DeepSeek-V3.2

The fastest model of the Flux 2 family. Frontier visual intelligence — state-of-the-art image generation and editing from Black Forest Labs

FLUX-2-klein-4b

The best quality-to-latency ratio, production apps model of the Flux 2 family. Frontier visual intelligence — state-of-the-art image generation and editing from Black Forest Labs

FLUX-2-klein-9b

claude

Claude

deepseek

DeepSeek

flux

Flux

gemini

Gemini

llama

Llama

mistral

Mistral

nemotron

Nemotron

qwen

Qwen

Llama Guard 4 is a natively multimodal safety classifier with 12 billion parameters trained jointly on text and multiple images. Llama Guard 4 is a dense architecture pruned from the Llama 4 Scout pre-trained model and fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM inputs (prompt classification) and in LLM responses (response classification). It itself acts as an LLM: it generates text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated.

You can POST to our OpenAI Chat Completions compatible endpoint.

Passing a url to an image is the easiest way to perform OCR.

```bash
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
      "model": "meta-llama/Llama-Guard-4-12B",
      "max_tokens": 4092,
      "messages": [
        {
          "role": "user",
          "content": [
            {
              "type": "image_url",
              "image_url": {
                "url": "https://url.com/to/shakespeare.png"
              }
            }
          ]
        }
      ]
    }'
```

Another options is to read the image from a file

```bash

BASE64_IMAGE=$(base64 -w 0 shakespeare.png)

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d @- <<EOF
{
  "model": "meta-llama/Llama-Guard-4-12B",
  "max_tokens": 4092,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/png;base64,$BASE64_IMAGE"
          }
        }
      ]
    }
  ]
}
EOF

```


You can use the official openai python client to run inferences with us

Passing a url to an image is the easiest way to perform OCR with our OpenAI-compabile API.

```python
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="meta-llama/Llama-Guard-4-12B",
    max_tokens=4092,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://url.com/to/shakespeare.png"
                    }
                }
            ]
        }
    ],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

# To be, or not to be, that is the question: Whether 'tis nobler in the mind to suffer
# 11 25
```

Another options is to read the image from a file

```python

import base64

from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

# Read and encode the local image file
with open("shakespeare.png", "rb") as image_file:
    base64_image = base64.b64encode(image_file.read()).decode('utf-8')

chat_completion = openai.chat.completions.create(
    model="meta-llama/Llama-Guard-4-12B",
    max_tokens=4092,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{base64_image}"
                    }
                }
            ]
        }
    ],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

# To be, or not to be, that is the question: Whether 'tis nobler in the mind to suffer
# 11 25
```


You can use JavaScript in the browser or in node.js to make requests

```bash
npm install openai
```

Passing a url to an image is the easiest way to perform OCR with our OpenAI-compabile API.

```javascript
import OpenAI from "openai";

const openai = new OpenAI({
    baseURL: 'https://api.deepinfra.com/v1/openai',
    apiKey: "$DEEPINFRA_TOKEN",
});

async function main() {
  const completion = await openai.chat.completions.create({
    messages: [
      {
        role: "user", 
        content: [
          {
            type: "image_url",
            image_url: {
              "url": "https://url.com/to/shakespeare.png"
            }
          }
        ]
      }
    ],
    model: "meta-llama/Llama-Guard-4-12B",
    max_tokens: 4092,
  });

  console.log(completion.choices[0].message.content);
  console.log(completion.usage.prompt_tokens, completion.usage.completion_tokens);
}

main();

// To be, or not to be, that is the question: Whether 'tis nobler in the mind to suffer
// 11 25
```

Another options is to read the image from a file

```javascript
import OpenAI from "openai";
import fs from "fs";

const openai = new OpenAI({
    baseURL: 'https://api.deepinfra.com/v1/openai',
    apiKey: "$DEEPINFRA_TOKEN",
});

async function main() {
  // Read and encode the local image file
  const imageBuffer = fs.readFileSync("shakespeare.png");
  const base64Image = imageBuffer.toString('base64');

  const completion = await openai.chat.completions.create({
    messages: [
      {
        role: "user", 
        content: [
          {
            type: "image_url",
            image_url: {
              "url": `data:image/png;base64,${base64Image}`
            }
          }
        ]
      }
    ],
    model: "meta-llama/Llama-Guard-4-12B",
    max_tokens: 4092,
  });

  console.log(completion.choices[0].message.content);
  console.log(completion.usage.prompt_tokens, completion.usage.completion_tokens);
}

main();

// To be, or not to be, that is the question: Whether 'tis nobler in the mind to suffer
// 11 25
```


The [AI SDK](https://sdk.vercel.ai/) by Vercel makes it very easy to do inference.

```bash
npm install ai @ai-sdk/deepinfra
```

Passing a url to an image is the easiest way to perform OCR.

```javascript
import { createDeepInfra } from "@ai-sdk/deepinfra";
import { generateText } from "ai";

const deepinfra = createDeepInfra({
  apiKey: "$DEEPINFRA_TOKEN",
});

const { text, usage, finishReason } = await generateText({
  maxOutputTokens: 4092,
  model: deepinfra("meta-llama/Llama-Guard-4-12B"),
  messages: [
    {
      role: "user", 
      content: [
        {
          type: "image",
          image: "https://url.com/to/shakespeare.png"
        }
      ]
    }
  ],
});

console.log(text);
console.log(usage);
console.log(finishReason);
```

Another options is to read the image from a file

```javascript
import { createDeepInfra } from "@ai-sdk/deepinfra";
import { generateText } from "ai";
import { readFileSync } from "fs";
import { join, dirname } from "path";
import { fileURLToPath } from "url";

const __dirname = dirname(fileURLToPath(import.meta.url));

const deepinfra = createDeepInfra({
  apiKey: "$DEEPINFRA_TOKEN",
});

const imageBuffer = readFileSync(join(__dirname, "shakespeare.png"));

const { text, usage, finishReason } = await generateText({
  maxOutputTokens: 4092,
  model: deepinfra("meta-llama/Llama-Guard-4-12B"),
  messages: [
    {
      role: "user", 
      content: [
        {
          type: "image",
          image: imageBuffer
        }
      ]
    }
  ],
});

console.log(text);
console.log(usage);
console.log(finishReason);
```



input

maximum length of the newly generated generated text.If explicitly set to None it will be the model's max context length minus input length or 65536, whichever is smaller

max_new_tokens

temperature to use for sampling. 0 means the output is deterministic. Values greater than 1 encourage more diversity

temperature

Sample from the set of tokens with highest probability such that sum of probabilies is higher than p. Lower values focus on the most probable tokens.Higher values sample more low-probability tokens

top_p

Float that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.

min_p

Sample from the best k (number of) tokens. 0 means off

top_k

repetition penalty. Value of 1 means no penalty, values greater than 1 discourage repetition, smaller than 1 encourage repetition.

repetition_penalty

Up to 16 strings that will terminate generation immediately

stop

Number of output sequences to return. Incompatible with streaming

num_responses

Optional nested object with "type" set to "json_object"

response_format

Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

presence_penalty

Positive values penalize new tokens based on how many times they appear in the text so far, increasing the model's likelihood to talk about new topics.

frequency_penalty

A unique identifier representing your end-user, which can help monitor and detect abuse. Avoid sending us any identifying information. We recommend hashing user identifiers.

user

Seed for random number generator. If not provided, a random seed is used. Determinism is not guaranteed.

seed

A key to identify prompt cache for reuse across requests. If provided, the prompt will be cached and can be reused in subsequent requests with the same key.

prompt_cache_key

The webhook to call when inference is done, by default you will get the output in the response of your inference request

webhook

Whether to stream tokens, by default it will be false, currently only supported for Llama 2 text generation models, token by token updates will be sent over SSE

stream

Type

JsonObjectResponseFormat

Name

Schema

JsonSchema

JSON schema for structured output when type is 'json_schema'

JsonSchemaResponseFormat

Regex pattern for structured output when type is 'regex'

Regex

RegexResponseFormat

TextResponseFormat

Frequency Penalty

Input

Max New Tokens

Min P

Num Responses

Presence Penalty

Prompt Cache Key

Repetition Penalty

Response Format

Seed

Stop

Stream

Temperature

Top K

Top P

User

Webhook

TextGenerationIn

Generated Text

GeneratedText

estimated cost billed for the request in USD

Cost

Output Length

Runtime Ms

Status

Tokens Generated

Tokens Input

InferenceReplyStatus

Object containing the status of the inference request

Num Input Tokens

number of generated tokens, excluding prompt

Num Tokens

Request Id

Results

TextGenerationOut

The service tier used for processing the request. 'priority' processes the request with higher priority (premium rate); 'flex' processes it at lower priority for a discount, served only when spare capacity exists and may be retried/timed out under load. Both apply only to models that support the respective tier.

service_tier

model

conversation messages: (user,assistant,tool)*,user including one system message anywhere

messages

whether to stream the output via SSE or return the full response

What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic

An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

The maximum number of tokens to generate in the chat completion.

The total length of input tokens and generated tokens is limited by the model's context length. If explicitly set to None it will be the model's max context length minus input length or 65536, whichever is smaller.

max_tokens

up to 16 sequences where the API will stop generating further tokens

Up to 16 token IDs where the API will stop generating further tokens. Merged with the model's built-in stop tokens. Intended for private deployments.

stop_token_ids

A list of tools the model may call. Currently, only functions are supported as a tool.

tools

Controls which (if any) function is called by the model. none means the model will not call a function and instead generates a message. auto means the model can pick between generating a message or calling a function. required means the model must call a function. defined tool means the model must call that specific tool. none is the default when no functions are present. auto is the default if functions are present.

tool_choice

The format of the response. Currently, only json is supported.

Alternative penalty for repetition, but multiplicative instead of additive (> 1 penalize, < 1 encourage)

Whether to return log probabilities of the output tokens or not.If true, returns the log probabilities of each output token returned in the `content` of `message`.

logprobs

stream_options

Constrains effort on reasoning for reasoning models. Currently supported values are none, low, medium, high, and xhigh. Reducing reasoning effort can result in faster responses and fewer tokens used on reasoning in a response. Setting to none disables reasoning entirely if the model supports.

reasoning_effort

reasoning

chat_template_kwargs

If set, the final assistant message is used as a prefix for the model to continue generating from, rather than starting a new turn. Only applicable when the last message in the conversation is an assistant message.

Tier	Input	Output
Priority (1.5×)Learn More	$0.27	$0.27
Flex (0.8×)Learn More	$0.144	$0.144