mistralai/Mixtral-8x22B-v0.1 cover image

mistralai/Mixtral-8x22B-v0.1

Mixtral-8x22B is the latest and largest mixture of expert large language model (LLM) from Mistral AI. This is state of the art machine learning model using a mixture 8 of experts (MoE) 22b models. During inference 2 expers are selected. This architecture allows large models to be fast and cheap at inference. This model is not instruction tuned.

Mixtral-8x22B is the latest and largest mixture of expert large language model (LLM) from Mistral AI. This is state of the art machine learning model using a mixture 8 of experts (MoE) 22b models. During inference 2 expers are selected. This architecture allows large models to be fast and cheap at inference. This model is not instruction tuned.

Public
fp16
65,536

OpenAI-compatible HTTP API

You can POST to our OpenAI Completions compatible endpoint.

However, this is an advanced and more complex API. We strongly recommend that you use OpenAI Chat Completions instead.

Simple prompt

To query this model you need to provide a properly formatted input string.

curl "https://api.deepinfra.com/v1/openai/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
     "model": "mistralai/Mixtral-8x22B-v0.1",
     "prompt": "<s>[INST] Hello! [/INST]",
     "stop": [
       "</s>"
     ]
   }'

To which you'd get something like

{
    "id": "cmpl-1b8401a68c5141eb825f68944dcea2c1",
    "object": "text_completion",
    "created": 1700578595,
    "model": "mistralai/Mixtral-8x22B-v0.1",
    "choices": [
        {
            "index": 0,
            "text": "Hello! It's nice to meet you. Is there something I can help you with or would you like to chat for a bit?",
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 4,
        "total_tokens": 9,
        "completion_tokens": 5,
        "estimated_cost": 0.00035493
    }
}

Conversations

The OpenAI Chat Completions API is better suited for chat-like conversations, use it instead.

However, you can still do it if you really need to. You have to add each response and each of the user prompts to every request. You need a properly formatted input string to make it understand the current context. See the example below for some of them. You can tweak it even further by providing a system message.

curl "https://api.deepinfra.com/v1/openai/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
     "model": "mistralai/Mixtral-8x22B-v0.1",
     "prompt": "<s>[INST] <<SYS>>\nRespond like a michelin starred chef.\n<</SYS>>\n\nCan you name at least two different techniques to cook lamb? [/INST] Bonjour! Let me tell you, my friend, cooking lamb is an art form, and I'"'"'m more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic \"Sous Vide\" method. Next, we have the ancient art of \"Sous le Sable\". And finally, we have the more modern technique of \"Hot Smoking.\" </s><s>[INST] Tell me more about the second method. [/INST]",
     "stop": [
       "</s>"
     ]
   }'

The conversation above might return something like the following

{
    "id": "cmpl-b23a3fb60cde42ce8f24bb980b4dee87",
    "object": "text_completion",
    "created": 1715688169,
    "model": "mistralai/Mixtral-8x22B-v0.1",
    "choices": [
        {
            "index": 0,
            "text": "Sous le Sable, my friend! It's an ancient technique that's been used for centuries in the Middle East and North Africa. The name itself...",
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 149,
        "total_tokens": 487,
        "completion_tokens": 338,
        "estimated_cost": 0.00035493
    }
}

The longer the conversation gets, the more time it takes the model to generate the response. The conversation is limited by the context size of a model. Larger models also usually take more time to respond.


Streaming

You can also perform a streaming request by passing "stream": true:

curl "https://api.deepinfra.com/v1/openai/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
     "model": "mistralai/Mixtral-8x22B-v0.1",
     "prompt": "<s>[INST] Hello! [/INST]",
     "stop": [
       "</s>"
     ],
     "stream": true
   }'

to which you'd get a sequence of SSE events, finishing with [DONE].

data: {"id": "cmpl-158cbf94cef043c2955172e8062ded3d", "object": "text_completion", "created": 1694623354, "model": "mistralai/Mixtral-8x22B-v0.1", "choices": [{"index": 0, "text": "Hi", "finish_reason": null}]}

data: {"id": "cmpl-158cbf94cef043c2955172e8062ded3d", "object": "text_completion", "created": 1694623354, "model": "mistralai/Mixtral-8x22B-v0.1", "choices": [{"index": 0, "text": "!", "finish_reason": null}]}

data: {"id": "cmpl-158cbf94cef043c2955172e8062ded3d", "object": "text_completion", "created": 1694623354, "model": "mistralai/Mixtral-8x22B-v0.1", "choices": [{"index": 0, "text": "", "finish_reason": null}]}

data: {"id": "cmpl-158cbf94cef043c2955172e8062ded3d", "object": "text_completion", "created": 1694623354, "model": "mistralai/Mixtral-8x22B-v0.1", "choices": [{"index": 0, "text": "", "finish_reason": "stop"}]}

data: [DONE]

Input format

You can see below the basic format of the input. Bear in mind that newlines often matter.

<s>[INST] Hello! [/INST]

Conversation prompts contain the history of the exchanged prompts and responses.

<s>[INST] First question [/INST] First answer </s><s>[INST] Second question [/INST] Second answer </s><s>[INST] Final question [/INST]

If you want to add system prompt, it is done like this

<s>[INST] <<SYS>>
System prompt
<</SYS>>

First question [/INST] First answer </s><s>[INST] Second question [/INST] Second answer </s><s>[INST] Final question [/INST]

Input fields

modelstring

model name


messagesarray

conversation messages: (user,assistant,tool)*,user including one system message anywhere


streamboolean

whether to stream the output via SSE or return the full response

Default value: false


temperaturenumber

What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic

Default value: 1

Range: 0 ≤ temperature ≤ 2


top_pnumber

An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

Default value: 1

Range: 0 < top_p ≤ 1


min_pnumber

Float that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.

Default value: 0

Range: 0 ≤ min_p ≤ 1


top_kinteger

Sample from the best k (number of) tokens. 0 means off

Default value: 0

Range: 0 ≤ top_k < 1000


max_tokensinteger

The maximum number of tokens to generate in the chat completion. The total length of input tokens and generated tokens is limited by the model's context length.

Range: 0 ≤ max_tokens ≤ 1000000


stopstring

up to 16 sequences where the API will stop generating further tokens


ninteger

number of sequences to return

Default value: 1

Range: 1 ≤ n ≤ 4


presence_penaltynumber

Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

Default value: 0

Range: -2 ≤ presence_penalty ≤ 2


frequency_penaltynumber

Positive values penalize new tokens based on how many times they appear in the text so far, increasing the model's likelihood to talk about new topics.

Default value: 0

Range: -2 ≤ frequency_penalty ≤ 2


toolsarray

A list of tools the model may call. Currently, only functions are supported as a tool.


tool_choicestring

Controls which (if any) function is called by the model. none means the model will not call a function and instead generates a message. auto means the model can pick between generating a message or calling a function. specifying a particular function choice is not supported currently.none is the default when no functions are present. auto is the default if functions are present.


response_formatobject

The format of the response. Currently, only json is supported.


repetition_penaltynumber

Alternative penalty for repetition, but multiplicative instead of additive (> 1 penalize, < 1 encourage)

Default value: 1

Range: 0.01 ≤ repetition_penalty ≤ 5


userstring

A unique identifier representing your end-user, which can help monitor and detect abuse. Avoid sending us any identifying information. We recommend hashing user identifiers.


seedinteger

Seed for random number generator. If not provided, a random seed is used. Determinism is not guaranteed.

Range: 0 ≤ seed < 9223372036854776000

Input Schema

Output Schema

Streaming Schema