mistralai/Mixtral-8x7B-Instruct-v0.1 cover image
featured

mistralai/Mixtral-8x7B-Instruct-v0.1

Mixtral is mixture of expert large language model (LLM) from Mistral AI. This is state of the art machine learning model using a mixture 8 of experts (MoE) 7b models. During inference 2 expers are selected. This architecture allows large models to be fast and cheap at inference. The Mixtral-8x7B outperforms Llama 2 70B on most benchmarks.

Mixtral is mixture of expert large language model (LLM) from Mistral AI. This is state of the art machine learning model using a mixture 8 of experts (MoE) 7b models. During inference 2 expers are selected. This architecture allows large models to be fast and cheap at inference. The Mixtral-8x7B outperforms Llama 2 70B on most benchmarks.

Public
$0.24 / Mtoken
bfloat16
32k
JSON
License

OpenAI-compatible HTTP API

You can POST to our OpenAI compatible endpoint:

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
      "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
      "messages": [
        {
          "role": "user",
          "content": "Hello!"
        }
      ]
    }'

To which you'd get something like:

{
    "id": "chatcmpl-guMTxWgpFf",
    "object": "chat.completion",
    "created": 1694623155,
    "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": " Hello! It's nice to meet you. Is there something I can help you with or would you like to chat for a bit?"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 15,
        "completion_tokens": 16,
        "total_tokens": 31
    }
}

You can also perform a streaming request by passing "stream": true:

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
      "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
      "stream": true,
      "messages": [
        {
          "role": "user",
          "content": "Hello!"
        }
      ]
    }'

to which you'd get a sequence of SSE events, finishing with [DONE].

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "mistralai/Mixtral-8x7B-Instruct-v0.1", "choices": [{"index": 0, "delta": {"role": "assistant", "content": " "}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "mistralai/Mixtral-8x7B-Instruct-v0.1", "choices": [{"index": 0, "delta": {"role": "assistant", "content": " Hi"}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "mistralai/Mixtral-8x7B-Instruct-v0.1", "choices": [{"index": 0, "delta": {"role": "assistant", "content": "!"}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "mistralai/Mixtral-8x7B-Instruct-v0.1", "choices": [{"index": 0, "delta": {"role": "assistant", "content": ""}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "mistralai/Mixtral-8x7B-Instruct-v0.1", "choices": [{"index": 0, "delta": {"role": "assistant", "content": "</s>"}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "mistralai/Mixtral-8x7B-Instruct-v0.1", "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}

data: [DONE]

Currently supported parameters:

  • temperature - more or less random generation
  • top_p - controls token sampling
  • max_tokens - maximum number of generated tokens
  • stop - up to 4 strings to terminate generation earlier
  • n - number of sequences to generate (up to 2)

Known caveats:

  • if the generation is terminated due to a stop sequence, the stop sequence is present in the output (but in OpenAI it is not).

Input fields

modelstring

model name


messagesarray

conversation messages: (user,assistant,tool)*,user including one system message anywhere


streamboolean

whether to stream the output via SSE or return the full response

Default value: false


temperaturenumber

What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic

Default value: 1

Range: 0 ≤ temperature ≤ 2


top_pnumber

Default value: 1

Range: 0 < top_p ≤ 1


max_tokensinteger

The maximum number of tokens to generate in the chat completion. The total length of input tokens and generated tokens is limited by the model's context length.If not set or None defaults to model's max context length minus input length.

Default value: 512

Range: 0 ≤ max_tokens ≤ 100000


stopstring

up to 16 sequences where the API will stop generating further tokens


ninteger

number of sequences to return. n != 1 incompatible with streaming

Default value: 1

Range: 1 ≤ n ≤ 2


presence_penaltynumber

Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

Default value: 0

Range: -2 ≤ presence_penalty ≤ 2


frequency_penaltynumber

Positive values penalize new tokens based on how many times they appear in the text so far, increasing the model's likelihood to talk about new topics.

Default value: 0

Range: -2 ≤ frequency_penalty ≤ 2


toolsarray

A list of tools the model may call. Currently, only functions are supported as a tool.


tool_choicestring

Controls which (if any) function is called by the model. none means the model will not call a function and instead generates a message. auto means the model can pick between generating a message or calling a function. specifying a particular function choice is not supported currently.none is the default when no functions are present. auto is the default if functions are present.


response_formatobject

The format of the response. Currently, only json is supported.


repetition_penaltynumber

Alternative penalty for repetition, but multiplicative instead of additive (> 1 penalize, < 1 encourage)

Default value: 1

Input Schema

Output Schema

Streaming Schema