gpt-oss-120b is an open-weight, 117B-parameter Mixture-of-Experts (MoE) language model from OpenAI designed for high-reasoning, agentic, and general-purpose production use cases. The model supports configurable reasoning depth, full chain-of-thought access, and native tool use, including function calling, browsing, and structured output generation.

gpt-oss-120b

GLM-4.5V is based on ZhipuAI’s next-generation flagship text foundation model GLM-4.5-Air (106B parameters, 12B active). It continues the technical approach of GLM-4.1V-Thinking, achieving SOTA performance among models of the same scale on 42 public vision-language benchmarks. It covers common tasks such as image, video, and document understanding, as well as GUI agent operations.

GLM-4.5V

gpt-oss-20b is an open-weight 21B parameter model released by OpenAI under the Apache 2.0 license. It uses a Mixture-of-Experts (MoE) architecture with 3.6B active parameters per forward pass, optimized for lower-latency inference. The model is trained in OpenAI’s Harmony response format and supports reasoning level configuration, fine-tuning, and agentic capabilities including function calling, tool use, and structured outputs.

gpt-oss-20b

Qwen3-Coder-480B-A35B-Instruct is the Qwen3's most agentic code model, featuring Significant Performance on Agentic Coding, Agentic Browser-Use and other foundational coding tasks, achieving results comparable to Claude Sonnet.

Qwen3-Coder-480B-A35B-Instruct-Turbo

The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications.

GLM-4.5

Kimi K2 is a large-scale Mixture-of-Experts (MoE) language model developed by Moonshot AI, featuring 1 trillion total parameters with 32 billion active per forward pass. It is optimized for agentic capabilities, including advanced tool use, reasoning, and code synthesis. Kimi K2 excels across a broad range of benchmarks, particularly in coding (LiveCodeBench, SWE-bench), reasoning (ZebraLogic, GPQA), and tool-use (Tau2, AceBench) tasks.

Kimi-K2-Instruct

olmOCR is a specialized AI tool that converts PDF documents into clean, structured text while preserving important formatting and layout information. What makes olmOCR particularly valuable for developers is its ability to handle challenging PDFs that traditional OCR tools struggle with—including complex layouts, poor-quality scans, handwritten text, and documents with mixed content types. Built on a fine-tuned 7B vision-language model, olmOCR provides enterprise-grade PDF processing at a fraction of the cost of proprietary solutions.

olmOCR-7B-0725-FP8

Qwen3-235B-A22B-Thinking-2507 is the Qwen3's new model with scaling the thinking capability of Qwen3-235B-A22B, improving both the quality and depth of reasoning. 

Qwen3-235B-A22B-Thinking-2507

Qwen3-Coder-480B-A35B-Instruct

GLM-4.5-Air

Voxtral Small is an enhancement of Mistral Small 3, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.

Voxtral-Small-24B-2507

Voxtral Mini is an enhancement of Ministral 3B, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.

Voxtral-Mini-3B-2507

The DeepSeek R1 0528 turbo model is a state of the art reasoning model that can generate very quick responses

DeepSeek-R1-0528-Turbo

Qwen3-235B-A22B-Instruct-2507 is the updated version of the Qwen3-235B-A22B non-thinking mode, featuring Significant improvements in general capabilities, including instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage.  

Qwen3-235B-A22B-Instruct-2507

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support

Qwen3-30B-A3B

Qwen3-32B

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support. 

Qwen3-14B

DeepSeek-V3-0324-Turbo

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Maverick, a 17 billion parameter model with 128 experts

Llama-4-Maverick-17B-128E-Instruct-Turbo

Llama-4-Maverick-17B-128E-Instruct-FP8

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Scout, a 17 billion parameter model with 16 experts

Llama-4-Scout-17B-16E-Instruct

The DeepSeek R1 model has undergone a minor version upgrade, with the current version being DeepSeek-R1-0528.

DeepSeek-R1-0528

DeepSeek-V3-0324, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token, an improved iteration over DeepSeek-V3.

DeepSeek-V3-0324

Devstral is an agentic LLM for software engineering tasks, making it a great choice for software engineering agents.

Devstral-Small-2507

Mistral-Small-3.2-24B-Instruct is a drop-in upgrade over the 3.1 release, with markedly better instruction following, roughly half the infinite-generation errors, and a more robust function-calling interface—while otherwise matching or slightly improving on all previous text and vision benchmarks.

Mistral-Small-3.2-24B-Instruct-2506

Llama Guard 4 is a natively multimodal safety classifier with 12 billion parameters trained jointly on text and multiple images. Llama Guard 4 is a dense architecture pruned from the Llama 4 Scout pre-trained model and fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM inputs (prompt classification) and in LLM responses (response classification). It itself acts as an LLM: it generates text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated.

Llama-Guard-4-12B

QwQ is the reasoning model of the Qwen series. Compared with conventional instruction-tuned models, QwQ, which is capable of thinking and reasoning, can achieve significantly enhanced performance in downstream tasks, especially hard problems. QwQ-32B is the medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini.

QwQ-32B

Anthropic’s most powerful model yet and the state-of-the-art coding model. It delivers sustained performance on long-running tasks that require focused effort and thousands of steps, significantly expanding what AI agents can solve. Claude Opus 4 is ideal for powering frontier agent products and features.

claude-4-opus

Anthropic's mid-size model with superior intelligence for high-volume uses in coding, in-depth research, agents, & more.

claude-4-sonnet

Gemini 2.5 Flash is Google's latest thinking model, designed to tackle increasingly complex problems. It's capable of reasoning through their thoughts before responding, resulting in enhanced performance and improved accuracy.  Gemini 2.5 Flash: best for balancing reasoning and speed.

gemini-2.5-flash

Gemini 2.5 Pro is Google's the most advanced thinking model, designed to tackle increasingly complex problems. Gemini 2.5 Pro leads common benchmarks by meaningful margins and showcases strong reasoning and code capabilities.  Gemini 2.5 models are thinking models, capable of reasoning through their thoughts before responding, resulting in enhanced performance and improved accuracy.  The Gemini 2.5 Pro model is now available on DeepInfra.

gemini-2.5-pro

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3 27B is Google's latest open source model, successor to Gemma 2

gemma-3-27b-it

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3-12B is Google's latest open source model, successor to Gemma 2

gemma-3-12b-it

gemma-3-4b-it

Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.

Kokoro-82M

Orpheus TTS is a state-of-the-art, Llama-based Speech-LLM designed for high-quality, empathetic text-to-speech generation. This model has been finetuned to deliver human-level speech synthesis, achieving exceptional clarity, expressiveness, and real-time streaming performances.

orpheus-3b-0.1-ft

CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.

csm-1b

DeepSeek-R1-Distill-Llama-70B is a highly efficient language model that leverages knowledge distillation to achieve state-of-the-art performance. This model distills the reasoning patterns of larger models into a smaller, more agile architecture, resulting in exceptional results on benchmarks like AIME 2024, MATH-500, and LiveCodeBench. With 70 billion parameters, DeepSeek-R1-Distill-Llama-70B offers a unique balance of accuracy and efficiency, making it an ideal choice for a wide range of natural language processing tasks. 

DeepSeek-R1-Distill-Llama-70B

DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. 

DeepSeek-V3

Llama 3.3-70B Turbo is a highly optimized version of the Llama 3.3-70B model, utilizing FP8 quantization to deliver significantly faster inference speeds with a minor trade-off in accuracy. The model is designed to be helpful, safe, and flexible, with a focus on responsible deployment and mitigating potential risks such as bias, toxicity, and misinformation. It achieves state-of-the-art performance on various benchmarks, including conversational tasks, language translation, and text generation.

Llama-3.3-70B-Instruct-Turbo

Llama 3.3-70B is a multilingual LLM trained on a massive dataset of 15 trillion tokens, fine-tuned for instruction-following and conversational dialogue. The model is designed to be helpful, safe, and flexible, with a focus on responsible deployment and mitigating potential risks such as bias, toxicity, and misinformation. It achieves state-of-the-art performance on various benchmarks, including conversational tasks, language translation, and text generation.

Llama-3.3-70B-Instruct

Phi-4 is a model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning.

phi-4

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford  et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation.

whisper-large-v3-turbo

You can POST to our OpenAI Chat Completions compatible endpoint.

#### Simple messages and prompts

Given a list of messages from a conversation, the model will return a response.

```bash
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
 -d '{
 "model": "meta-llama/Llama-Guard-4-12B",
 "messages": [
 {
 "role": "user",
 "content": "Hello!"
 }
 ]
 }'
```

To which you'd get something like:

```json
{
 "id": "chatcmpl-guMTxWgpFf",
 "object": "chat.completion",
 "created": 1694623155,
 "model": "meta-llama/Llama-Guard-4-12B",
 "choices": [
 {
 "index": 0,
 "message": {
 "role": "assistant",
 "content": " Hello! It's nice to meet you. Is there something I can help you with or would you like to chat for a bit?"
 },
 "finish_reason": "stop"
 }
 ],
 "usage": {
 "prompt_tokens": 15,
 "completion_tokens": 16,
 "total_tokens": 31,
 "estimated_cost": 0.0000268
 }
}
```

#### Conversations

To create a longer chat-like conversation you just have to add each response message and each of the user messages to every request. This way the model will have the context and will be able to provide better answers. You can tweak it even further by providing a system message.

```bash
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
 -d '{
 "model": "meta-llama/Llama-Guard-4-12B",
 "messages": [
 {
 "role": "system",
 "content": "Respond like a michelin starred chef."
 },
 {
 "role": "user",
 "content": "Can you name at least two different techniques to cook lamb?"
 },
 {
 "role": "assistant",
 "content": "Bonjour! Let me tell you, my friend, cooking lamb is an art form, and I'"'"'m more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic \"Sous Vide\" method. Next, we have the ancient art of \"Sous le Sable\". And finally, we have the more modern technique of \"Hot Smoking.\""
 },
 {
 "role": "user",
 "content": "Tell me more about the second method."
 }
 ]
 }'
```

The conversation above might return something like the following

```json
{
 "id": "chatcmpl-b23a3fb60cde42ce8f24bb980b4dee87",
 "object": "chat.completion",
 "created": 1715688169,
 "model": "meta-llama/Llama-Guard-4-12B",
 "choices": [
 {
 "index": 0,
 "message": {
 "role": "assistant",
 "content": "Sous le Sable, my friend! It's an ancient technique that's been used for centuries in the Middle East and North Africa. The name itself..."
 },
 "finish_reason": "stop"
 }
 ],
 "usage": {
 "prompt_tokens": 149,
 "total_tokens": 487,
 "completion_tokens": 338,
 "estimated_cost": 0.00035493
 }
}
```

The longer the conversation gets, the more time it takes the model to generate the response. The number of messages that you can have in a conversation is limited by the context size of a model. Larger models also usually take more time to respond.

 

### Streaming

You can turn any of the requests above into a streaming request by passing `"stream": true`:

```bash
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
 -d '{
 "model": "meta-llama/Llama-Guard-4-12B",
 "stream": true,
 "messages": [
 {
 "role": "user",
 "content": "Hello!"
 }
 ]
 }'
```

to which you'd get a sequence of [SSE](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events) events, finishing with `[DONE]`.

```
data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "meta-llama/Llama-Guard-4-12B", "choices": [{"index": 0, "delta": {"role": "assistant", "content": " "}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "meta-llama/Llama-Guard-4-12B", "choices": [{"index": 0, "delta": {"role": "assistant", "content": " Hi"}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "meta-llama/Llama-Guard-4-12B", "choices": [{"index": 0, "delta": {"role": "assistant", "content": "!"}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "meta-llama/Llama-Guard-4-12B", "choices": [{"index": 0, "delta": {"role": "assistant", "content": ""}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "meta-llama/Llama-Guard-4-12B", "choices": [{"index": 0, "delta": {"role": "assistant", "content": "</s>"}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "meta-llama/Llama-Guard-4-12B", "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}

data: [DONE]
```

You can use the official openai python client to run inferences with us

```python
# Assume openai>=1.0.0
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
 api_key="$DEEPINFRA_TOKEN",
 base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
 model="meta-llama/Llama-Guard-4-12B",
 messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
```

#### Conversations

To create a longer chat-like conversation you just have to add each response message and each of the user messages to every request. This way the model will have the context and will be able to provide better answers. You can tweak it even further by providing a system message.

```python
# Assume openai>=1.0.0
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
 api_key="$DEEPINFRA_TOKEN",
 base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
 model="meta-llama/Llama-Guard-4-12B",
 messages=[
 {"role": "system", "content": "Respond like a michelin starred chef."},
 {"role": "user", "content": "Can you name at least two different techniques to cook lamb?"},
 {"role": "assistant", "content": "Bonjour! Let me tell you, my friend, cooking lamb is an art form, and I'm more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic \"Sous Vide\" method. Next, we have the ancient art of \"Sous le Sable\". And finally, we have the more modern technique of \"Hot Smoking.\""},
 {"role": "user", "content": "Tell me more about the second method."},
 ],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

# Sous le Sable! It's an ancient technique that never goes out of style, n'est-ce pas? Literally ...
# 149 324
```

The longer the conversation gets, the more time it takes the model to generate the response. The number of messages that you can have in a conversation is limited by the context size of a model. Larger models also usually take more time to respond.

 

### Streaming

Streaming any of the chat completions above is supported by adding the `stream=True` option.


```python
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
 api_key="$DEEPINFRA_TOKEN",
 base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
 model="meta-llama/Llama-Guard-4-12B",
 messages=[{"role": "user", "content": "Hello"}],
 stream=True,
)

for event in chat_completion:
 if event.choices[0].finish_reason:
 print(event.choices[0].finish_reason, event.usage["prompt_tokens"], event.usage["completion_tokens"])
 else:
 print(event.choices[0].delta.content)

# Hello
# !
# It
# 's
# nice
# ...
# 11 25
```

You can use JavaScript in the browser or in node.js to make requests

```bash
npm install openai
```

then

```javascript
import OpenAI from "openai";

const openai = new OpenAI({
 baseURL: 'https://api.deepinfra.com/v1/openai',
 apiKey: "$DEEPINFRA_TOKEN",
});

async function main() {
 const completion = await openai.chat.completions.create({
 messages: [{ role: "user", content: "Hello" }],
 model: "meta-llama/Llama-Guard-4-12B",
 });

 console.log(completion.choices[0].message.content);
 console.log(completion.usage.prompt_tokens, completion.usage.completion_tokens);
}

main();

// Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
// 11 25
```

#### Conversations

To create a longer chat-like conversation you just have to add each response message and each of the user messages to every request. This way the model will have the context and will be able to provide better answers. You can tweak it even further by providing a system message.

```javascript
import OpenAI from "openai";

const openai = new OpenAI({
 baseURL: 'https://api.deepinfra.com/v1/openai',
 apiKey: "$DEEPINFRA_TOKEN",
});

async function main() {
 const completion = await openai.chat.completions.create({
 messages: [
 {role: "system", content: "Respond like a michelin starred chef."},
 {role: "user", content: "Can you name at least two different techniques to cook lamb?"},
 {role: "assistant", content: "Bonjour! Let me tell you, my friend, cooking lamb is an art form, and I'm more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic \"Sous Vide\" method. Next, we have the ancient art of \"Sous le Sable\". And finally, we have the more modern technique of \"Hot Smoking.\""},
 {role: "user", "content": "Tell me more about the second method."}
 ],
 model: "meta-llama/Llama-Guard-4-12B",
 });

 console.log(completion.choices[0].message.content);
 console.log(completion.usage.prompt_tokens, completion.usage.completion_tokens);
}

main();

// Sous le Sable, my friend! This traditional technique hails from the ancient Mediterranean, wher...
// 149 324
```

The longer the conversation gets, the more time it takes the model to generate the response. The number of messages that you can have in a conversation is limited by the context size of a model. Larger models also usually take more time to respond.

 

### Streaming

Streaming any of the chat completions above is supported by adding the `stream: true` option.

```javascript
import OpenAI from "openai";

const openai = new OpenAI({
 baseURL: 'https://api.deepinfra.com/v1/openai',
 apiKey: "$DEEPINFRA_TOKEN",
});

async function main() {
 const completion = await openai.chat.completions.create({
 messages: [{ role: "user", content: "Hello" }],
 model: "meta-llama/Llama-Guard-4-12B",
 stream: true,
 });

 for await (const chunk of completion) {
 if (chunk.choices[0].finish_reason) {
 console.log(chunk.choices[0].finish_reason, chunk.usage.prompt_tokens, chunk.usage.completion_tokens);
 } else {
 console.log(chunk.choices[0].delta.content);
 }
 }
}

main();

// Hello
// !
// It
// 's
// nice
// ...
// 11 25
```

The [AI SDK](https://sdk.vercel.ai/) by Vercel makes it very easy to do inference.

```bash
npm install ai @ai-sdk/deepinfra
```

then

```javascript
import { createDeepInfra } from "@ai-sdk/deepinfra";
import { generateText } from "ai";

const deepinfra = createDeepInfra({
 apiKey: "$DEEPINFRA_TOKEN",
});

const { text, usage, finishReason } = await generateText({
 model: deepinfra("meta-llama/Llama-Guard-4-12B"),
 prompt: "Write a vegetarian lasagna recipe for 4 people.",
});

console.log(text);
console.log(usage);
console.log(finishReason);
```

#### Conversations

To create a longer chat-like conversation you just have to add each response message and each of the user messages to every request. This way the model will have the context and will be able to provide better answers. You can tweak it even further by providing a system message.

```javascript
import { createDeepInfra } from "@ai-sdk/deepinfra";
import { generateText } from "ai";

const deepinfra = createDeepInfra({
 apiKey: "$DEEPINFRA_TOKEN",
});

const { text, usage, finishReason } = await generateText({
 model: deepinfra("meta-llama/Llama-Guard-4-12B"),
 messages: [
 { role: "system", content: "Respond like a michelin starred chef." },
 {
 role: "user",
 content: "Can you name at least two different techniques to cook lamb?",
 },
 {
 role: "assistant",
 content:
 'Bonjour! Let me tell you, my friend, cooking lamb is an art form, and I\'m more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic "Sous Vide" method. Next, we have the ancient art of "Sous le Sable". And finally, we have the more modern technique of "Hot Smoking."',
 },
 { role: "user", content: "Tell me more about the second method." },
 ],
});

console.log(text);
console.log(usage);
console.log(finishReason);
```

The longer the conversation gets, the more time it takes the model to generate the response. The number of messages that you can have in a conversation is limited by the context size of a model. Larger models also usually take more time to respond.

 

### Streaming

Streaming text responses is easy just replace `generateText` with `streamText` and read the response chunk by chunk

```javascript
import { createDeepInfra } from "@ai-sdk/deepinfra";
import { streamText } from "ai";

const deepinfra = createDeepInfra({
 apiKey: "$DEEPINFRA_TOKEN",
});

const result = streamText({
 model: deepinfra("meta-llama/Llama-3.3-70B-Instruct-Turbo"),
 prompt: "Invent a new holiday and describe its traditions.",
 system:
 "You are a professional writer. You write simple, clear, and concise content.",
});

for await (const textPart of result.textStream) {
 console.log(textPart);
}

console.log(await result.usage);
console.log(await result.finishReason);
```

It works for conversations, too.

This is an advanced and more complex API. We strongly recommend that you use OpenAI Chat Completions instead.

#### Simple prompt

To query this model you need to provide a properly formatted input string.

```bash
curl "https://api.deepinfra.com/v1/inference/meta-llama/Llama-Guard-4-12B" \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
 -d '{
 "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
 "stop": []
 }'
```

That will respond with

```json
{
 "request_id": "RWZDRhS5kdoM1XWwXLEshynO",
 "inference_status": {
 "status": "succeeded",
 "runtime_ms": 243,
 "cost": 0.0000436,
 "tokens_input": 12,
 "tokens_generated": 25
 },
 "results": [
 {
 "generated_text": "Hello! It's nice to meet you. Is there something I can help you with or would you like to chat for a bit?"
 }
 ],
 "num_tokens": 25,
 "num_input_tokens":12
}
```

#### Conversations

The OpenAI Chat Completions API is better suited for chat-like conversations, use it instead.

To query this model you need to provide a properly formatted input string.

However, you can still do it if you really need to. You have to add each response and each of the user prompts to every request.
You need a properly formatted input string to make it understand the current context. See the example below for some of them.
You can tweak it even further by providing a system message.

```bash
curl "https://api.deepinfra.com/v1/inference/meta-llama/Llama-Guard-4-12B" \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
 -d '{
 "input": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nRespond like a michelin starred chef.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCan you name at least two different techniques to cook lamb?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBonjour! Let me tell you, my friend, cooking lamb is an art form, and I'"'"'m more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic \"Sous Vide\" method. Next, we have the ancient art of \"Sous le Sable\". And finally, we have the more modern technique of \"Hot Smoking.\"<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTell me more about the second method.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
 "stop": []
 }'
```

The conversation above might return something like the following

```json
{
 "request_id": "RWZDRhS5kdoM1XWwXLEshynO",
 "inference_status": {
 "status": "succeeded",
 "runtime_ms": 243,
 "cost": 0.000436,
 "tokens_input": 149,
 "tokens_generated": 338
 },
 "results": [
 {
 "generated_text": "Sous le Sable, my friend! It's an ancient technique that's been used for centuries in the Middle East and North Africa. The name itself..."
 }
 ],
 "num_tokens": 338,
 "num_input_tokens": 149
}
```

The longer the conversation gets, the more time it takes the model to generate the response. The conversation is limited by the context size of a model. Larger models also usually take more time to respond.

 

### Streaming


To do a streaming request, just pass `"stream": true`:

```bash
curl "https://api.deepinfra.com/v1/inference/meta-llama/Llama-Guard-4-12B" \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
 -d '{
 "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
 "stop": [],
 "stream": true
 }'
```

which outputs:

```json
data: {"token": {"id": null, "text": "Hello", "logprob": 0.0, "special": false}, "generated_text": "", "details": null, "estimated_cost": null}

data: {"token": {"id": null, "text": "!", "logprob": 0.0, "special": false}, "generated_text": "", "details": null, "estimated_cost": null}

data: {"token": {"id": null, "text": " It", "logprob": 0.0, "special": false}, "generated_text": "", "details": null, "estimated_cost": null}

data: {"token": {"id": null, "text": "'s", "logprob": 0.0, "special": false}, "generated_text": "", "details": null, "estimated_cost": null}

....

data: {"token": {"id": null, "text": "", "logprob": 0.0, "special": false}, "generated_text": null, "details": {"finish_reason": "stop"}, "num_output_tokens": 25, "num_input_tokens": 12, "estimated_cost": 0.0000386}
```

#### Input format

You can see below the basic format of the input. Bear in mind that newlines often matter.

```text
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```

Conversation prompts contain the history of the exchanged prompts and responses.

```text
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

First question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

First answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Second question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Second answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Final question<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```

If you want to add system prompt, it is done like this

```text
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

System prompt<|eot_id|><|start_header_id|>user<|end_header_id|>

First question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

First answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Second question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Second answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Final question<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```

You can use our command-line tool [deepctl](/docs/getting-started) to run inferences:

This is an advanced and more complex API. We strongly recommend that you use OpenAI Chat Completions instead.

#### Simple prompt

To query this model you need to provide a properly formatted input string.

```bash
deepctl infer \
 -m 'meta-llama/Llama-Guard-4-12B' \
 -i 'input="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"' \
 -i 'stop=[]'
```

That will respond with

```json
{
 "inference_status": {
 "status": "succeeded",
 "runtime_ms": 243,
 "cost": 0.0000436,
 "tokens_input": 12,
 "tokens_generated": 25
 },
 "results": [
 {
 "generated_text": "Hello! It's nice to meet you. Is there something I can help you with or would you like to chat for a bit?"
 }
 ],
 "num_tokens": 25,
 "num_input_tokens":12
}
```

#### Conversations

The OpenAI Chat Completions API is better suited for chat-like conversations, use it instead.

To query this model you need to provide a properly formatted input string.

However, you can still do it if you really need to. You have to add each response and each of the user prompts to every request.
You need a properly formatted input string to make it understand the current context. See the example below for some of them.
You can tweak it even further by providing a system message.

```bash
deepctl infer \
 -m 'meta-llama/Llama-Guard-4-12B' \
 -i 'input="<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nRespond like a michelin starred chef.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCan you name at least two different techniques to cook lamb?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBonjour! Let me tell you, my friend, cooking lamb is an art form, and I'"'"'m more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic \"Sous Vide\" method. Next, we have the ancient art of \"Sous le Sable\". And finally, we have the more modern technique of \"Hot Smoking.\"<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTell me more about the second method.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"' \
 -i 'stop=[]'
```

The conversation above might return something like the following

```json
{
 "inference_status": {
 "status": "succeeded",
 "runtime_ms": 243,
 "cost": 0.000436,
 "tokens_input": 149,
 "tokens_generated": 338
 },
 "results": [
 {
 "generated_text": "Sous le Sable, my friend! It's an ancient technique that's been used for centuries in the Middle East and North Africa. The name itself..."
 }
 ],
 "num_tokens": 338,
 "num_input_tokens": 149
}
```

The longer the conversation gets, the more time it takes the model to generate the response. The conversation is limited by the context size of a model. Larger models also usually take more time to respond.

#### Input format

You can see below the basic format of the input. Bear in mind that newlines often matter.

```text
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```

Conversation prompts contain the history of the exchanged prompts and responses.

```text
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

First question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

First answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Second question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Second answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Final question<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```

If you want to add system prompt, it is done like this

```text
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

System prompt<|eot_id|><|start_header_id|>user<|end_header_id|>

First question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

First answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Second question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Second answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Final question<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```

We recommend using our NodeJS client https://github.com/deepinfra/deepinfra-node.

You can install it with

```bash
npm install deepinfra
```

#### Simple prompt

To query this model you need to provide a properly formatted input string.

```javascript
import { TextGeneration } from "deepinfra";

const DEEPINFRA_API_KEY = '$DEEPINFRA_TOKEN';
const MODEL_URL = 'https://api.deepinfra.com/v1/inference/meta-llama/Llama-Guard-4-12B';

async function main() {
 const client = new TextGeneration(MODEL_URL, DEEPINFRA_API_KEY);
 const res = await client.generate({
 "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
 "stop": []
 });
 console.log(res.results[0].generated_text);
}

main();

// Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
```

#### Conversations

The OpenAI Chat Completions API is better suited for chat-like conversations, use it instead.

However, you can still do it if you really need to. You have to add each response and each of the user prompts to every request.
You need a properly formatted input string to make it understand the current context. See the example below for some of them.
You can tweak it even further by providing a system message.

```javascript
import { TextGeneration } from "deepinfra";

const DEEPINFRA_API_KEY = '$DEEPINFRA_TOKEN';
const MODEL_URL = 'https://api.deepinfra.com/v1/inference/meta-llama/Llama-Guard-4-12B';

async function main() {
 const client = new TextGeneration(MODEL_URL, DEEPINFRA_API_KEY);
 const res = await client.generate({
 "input": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nRespond like a michelin starred chef.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCan you name at least two different techniques to cook lamb?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBonjour! Let me tell you, my friend, cooking lamb is an art form, and I'm more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic \"Sous Vide\" method. Next, we have the ancient art of \"Sous le Sable\". And finally, we have the more modern technique of \"Hot Smoking.\"<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTell me more about the second method.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
 "stop": []
 });
 console.log(res.results[0].generated_text);
}

main();

// Sous le Sable! It's an ancient technique that never goes out of style, n'est-ce pas? Literally ...
```

The longer the conversation gets, the more time it takes the model to generate the response.
The number of messages that you can have in a conversation is limited by the context size of a model.
Larger models also usually take more time to respond.

#### Input format

You can see below the basic format of the input. Bear in mind that newlines often matter.

```text
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```

Conversation prompts contain the history of the exchanged prompts and responses.

```text
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

First question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

First answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Second question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Second answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Final question<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```

If you want to add system prompt, it is done like this

```text
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

System prompt<|eot_id|><|start_header_id|>user<|end_header_id|>

First question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

First answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Second question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Second answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Final question<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```

input

maximum length of the newly generated generated text.If explicitly set to None it will be the model's max context length minus input length or 16384, whichever is smaller

max_new_tokens

temperature to use for sampling. 0 means the output is deterministic. Values greater than 1 encourage more diversity

temperature

Sample from the set of tokens with highest probability such that sum of probabilies is higher than p. Lower values focus on the most probable tokens.Higher values sample more low-probability tokens

top_p

Float that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.

min_p

Sample from the best k (number of) tokens. 0 means off

top_k

repetition penalty. Value of 1 means no penalty, values greater than 1 discourage repetition, smaller than 1 encourage repetition.

repetition_penalty

Up to 16 strings that will terminate generation immediately

stop

Number of output sequences to return. Incompatible with streaming

num_responses

Optional nested object with "type" set to "json_object"

response_format

Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

presence_penalty

Positive values penalize new tokens based on how many times they appear in the text so far, increasing the model's likelihood to talk about new topics.

frequency_penalty

A unique identifier representing your end-user, which can help monitor and detect abuse. Avoid sending us any identifying information. We recommend hashing user identifiers.

user

Seed for random number generator. If not provided, a random seed is used. Determinism is not guaranteed.

seed

The webhook to call when inference is done, by default you will get the output in the response of your inference request

webhook

Whether to stream tokens, by default it will be false, currently only supported for Llama 2 text generation models, token by token updates will be sent over SSE

stream

Name

Schema

JsonSchema

JSON schema for structured output when type is 'json_schema'

Type

ResponseFormat

Frequency Penalty

Input

Max New Tokens

Min P

Num Responses

Presence Penalty

Repetition Penalty

Seed

Stop

Stream

Temperature

Top K

Top P

User

Webhook

TextGenerationIn

Llama

Generated Text

GeneratedText

estimated cost billed for the request in USD

Cost

Runtime Ms

Status

Tokens Generated

Tokens Input

InferenceReplyStatus

Object containing the status of the inference request

Num Input Tokens

number of generated tokens, excluding prompt

Num Tokens

Request Id

Results

TextGenerationOut

model

conversation messages: (user,assistant,tool)*,user including one system message anywhere

messages

whether to stream the output via SSE or return the full response

What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic

An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

The maximum number of tokens to generate in the chat completion.

The total length of input tokens and generated tokens is limited by the model's context length. If explicitly set to None it will be the model's max context length minus input length or 16384, whichever is smaller.

max_tokens

up to 16 sequences where the API will stop generating further tokens

A list of tools the model may call. Currently, only functions are supported as a tool.

tools

Controls which (if any) function is called by the model. none means the model will not call a function and instead generates a message. auto means the model can pick between generating a message or calling a function. specifying a particular function choice is not supported currently.none is the default when no functions are present. auto is the default if functions are present.

tool_choice

The format of the response. Currently, only json is supported.

Alternative penalty for repetition, but multiplicative instead of additive (> 1 penalize, < 1 encourage)

Whether to return log probabilities of the output tokens or not.If true, returns the log probabilities of each output token returned in the `content` of `message`.

logprobs

stream_options

Constrains effort on reasoning for reasoning models. Currently supported values are none, low, medium, and high. Reducing reasoning effort can result in faster responses and fewer tokens used on reasoning in a response. Setting to none disables reasoning entirely if the model supports.

Llama-Guard-4-12B

Unlock the most affordable AI hosting