We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

Qwen/Qwen2.5-Coder-32B-Instruct

Qwen2.5-Coder is the latest series of Code-Specific Qwen large language models (formerly known as CodeQwen). It has significant improvements in code generation, code reasoning and code fixing. A more comprehensive foundation for real-world applications such as Code Agents. Not only enhancing coding capabilities but also maintaining its strengths in mathematics and general competencies.

Public

$0.06/$0.15 in/out Mtoken

fp8

32,768

Project License

demoversions

OpenAI-compatible HTTP API

You can POST to our OpenAI Chat Completions compatible endpoint.

Simple messages and prompts

Given a list of messages from a conversation, the model will return a response.

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
      "model": "Qwen/Qwen2.5-Coder-32B-Instruct",
      "messages": [
        {
          "role": "user",
          "content": "Hello!"
        }
      ]
    }'
copy

To which you'd get something like:

{
    "id": "chatcmpl-guMTxWgpFf",
    "object": "chat.completion",
    "created": 1694623155,
    "model": "Qwen/Qwen2.5-Coder-32B-Instruct",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": " Hello! It's nice to meet you. Is there something I can help you with or would you like to chat for a bit?"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 15,
        "completion_tokens": 16,
        "total_tokens": 31,
        "estimated_cost": 0.0000268
    }
}
copy

Conversations

To create a longer chat-like conversation you just have to add each response message and each of the user messages to every request. This way the model will have the context and will be able to provide better answers. You can tweak it even further by providing a system message.

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
      "model": "Qwen/Qwen2.5-Coder-32B-Instruct",
      "messages": [
        {
            "role": "system",
            "content": "Respond like a michelin starred chef."
        },
        {
          "role": "user",
          "content": "Can you name at least two different techniques to cook lamb?"
        },
        {
          "role": "assistant",
          "content": "Bonjour! Let me tell you, my friend, cooking lamb is an art form, and I'"'"'m more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic \"Sous Vide\" method. Next, we have the ancient art of \"Sous le Sable\". And finally, we have the more modern technique of \"Hot Smoking.\""
        },
        {
          "role": "user",
          "content": "Tell me more about the second method."
        }
      ]
    }'
copy

The conversation above might return something like the following

{
    "id": "chatcmpl-b23a3fb60cde42ce8f24bb980b4dee87",
    "object": "chat.completion",
    "created": 1715688169,
    "model": "Qwen/Qwen2.5-Coder-32B-Instruct",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Sous le Sable, my friend! It's an ancient technique that's been used for centuries in the Middle East and North Africa. The name itself..."
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 149,
        "total_tokens": 487,
        "completion_tokens": 338,
        "estimated_cost": 0.00035493
    }
}
copy

The longer the conversation gets, the more time it takes the model to generate the response. The number of messages that you can have in a conversation is limited by the context size of a model. Larger models also usually take more time to respond.

Streaming

You can turn any of the requests above into a streaming request by passing "stream": true:

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
      "model": "Qwen/Qwen2.5-Coder-32B-Instruct",
      "stream": true,
      "messages": [
        {
          "role": "user",
          "content": "Hello!"
        }
      ]
    }'
copy

to which you'd get a sequence of SSE events, finishing with [DONE].

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "Qwen/Qwen2.5-Coder-32B-Instruct", "choices": [{"index": 0, "delta": {"role": "assistant", "content": " "}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "Qwen/Qwen2.5-Coder-32B-Instruct", "choices": [{"index": 0, "delta": {"role": "assistant", "content": " Hi"}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "Qwen/Qwen2.5-Coder-32B-Instruct", "choices": [{"index": 0, "delta": {"role": "assistant", "content": "!"}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "Qwen/Qwen2.5-Coder-32B-Instruct", "choices": [{"index": 0, "delta": {"role": "assistant", "content": ""}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "Qwen/Qwen2.5-Coder-32B-Instruct", "choices": [{"index": 0, "delta": {"role": "assistant", "content": "</s>"}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "Qwen/Qwen2.5-Coder-32B-Instruct", "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}

data: [DONE]

Input fields

`model`string

model name

`messages`array

conversation messages: (user,assistant,tool)*,user including one system message anywhere

`stream`boolean

whether to stream the output via SSE or return the full response

Default value: false

`temperature`number

What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic

Default value: 1

Range: 0 ≤ temperature ≤ 2

`top_p`number

An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

Default value: 1

Range: 0 < top_p ≤ 1

`min_p`number

Float that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.

Default value: 0

Range: 0 ≤ min_p ≤ 1

`top_k`integer

Sample from the best k (number of) tokens. 0 means off

Default value: 0

Range: 0 ≤ top_k < 1000

`max_tokens`integer

The maximum number of tokens to generate in the chat completion. The total length of input tokens and generated tokens is limited by the model's context length. If explicitly set to None it will be the model's max context length minus input length or 16384, whichever is smaller.

Range: 0 ≤ max_tokens ≤ 1000000

`stop`string

up to 16 sequences where the API will stop generating further tokens

`n`integer

number of sequences to return

Default value: 1

Range: 1 ≤ n ≤ 4

`presence_penalty`number

Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

Default value: 0

Range: -2 ≤ presence_penalty ≤ 2

`frequency_penalty`number

Positive values penalize new tokens based on how many times they appear in the text so far, increasing the model's likelihood to talk about new topics.

Default value: 0

Range: -2 ≤ frequency_penalty ≤ 2

`tools`array

A list of tools the model may call. Currently, only functions are supported as a tool.

`tool_choice`string

Controls which (if any) function is called by the model. none means the model will not call a function and instead generates a message. auto means the model can pick between generating a message or calling a function. specifying a particular function choice is not supported currently.none is the default when no functions are present. auto is the default if functions are present.

`response_format`object

The format of the response. Currently, only json is supported.

`repetition_penalty`number

Alternative penalty for repetition, but multiplicative instead of additive (> 1 penalize, < 1 encourage)

Default value: 1

Range: 0.01 ≤ repetition_penalty ≤ 5

`user`string

A unique identifier representing your end-user, which can help monitor and detect abuse. Avoid sending us any identifying information. We recommend hashing user identifiers.

`seed`integer

Seed for random number generator. If not provided, a random seed is used. Determinism is not guaranteed.

Range: -9223372036854776000 ≤ seed < 18446744073709552000

`logprobs`boolean

Whether to return log probabilities of the output tokens or not.If true, returns the log probabilities of each output token returned in the `content` of `message`.

`stream_options`object

streaming options

`reasoning_effort`string

Constrains effort on reasoning for reasoning models. Currently supported values are none, low, medium, and high. Reducing reasoning effort can result in faster responses and fewer tokens used on reasoning in a response. Setting to none disables reasoning entirely if the model supports.

Allowed values: lowmediumhighnone

Input Schema

Output Schema

Streaming Schema

Unlock the most affordable AI hosting

Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.

Contact Sales Get Started

Latest Models

Gryphe/

MythoMax-L2-13b

openai/

whisper-tiny

openchat/

openchat_3.5

bigcode/

starcoder2-15b

Phind/

Phind-CodeLlama-34B-v2

Featured Models

mistralai/

Mistral-Small-3.2-24B-Instruct-2506

google/

gemma-3-12b-it

Qwen/

Qwen3-Coder-480B-A35B-Instruct-Turbo

meta-llama/

Llama-4-Scout-17B-16E-Instruct

zai-org/

GLM-4.5-Air

google/

gemma-3-27b-it

Company

Pricing

Docs

Compare

DeepStart

About

Careers

Trust Center

Privacy

Terms

Have questions or need a custom solution?

Contact Sales

Qwen/Qwen2.5-Coder-32B-Instruct

OpenAI-compatible HTTP API

Simple messages and prompts

Conversations

Streaming

Input fields

modelstring

messagesarray

streamboolean

temperaturenumber

top_pnumber

min_pnumber

top_kinteger

max_tokensinteger

stopstring

ninteger

presence_penaltynumber

frequency_penaltynumber

toolsarray

tool_choicestring

response_formatobject

repetition_penaltynumber

userstring

seedinteger

logprobsboolean

stream_optionsobject

reasoning_effortstring

Input Schema

Output Schema

Streaming Schema

Unlock the most affordable AI hosting

`model`string

`messages`array

`stream`boolean

`temperature`number

`top_p`number

`min_p`number

`top_k`integer

`max_tokens`integer

`stop`string

`n`integer

`presence_penalty`number

`frequency_penalty`number

`tools`array

`tool_choice`string

`response_format`object

`repetition_penalty`number

`user`string

`seed`integer

`logprobs`boolean

`stream_options`object

`reasoning_effort`string