Mixtral is mixture of expert large language model (LLM) from Mistral AI. This is state of the art machine learning model using a mixture 8 of experts (MoE) 7b models. During inference 2 expers are selected. This architecture allows large models to be fast and cheap at inference. The Mixtral-8x7B outperforms Llama 2 70B on most benchmarks.
Mixtral is mixture of expert large language model (LLM) from Mistral AI. This is state of the art machine learning model using a mixture 8 of experts (MoE) 7b models. During inference 2 expers are selected. This architecture allows large models to be fast and cheap at inference. The Mixtral-8x7B outperforms Llama 2 70B on most benchmarks.
You can POST to our OpenAI compatible endpoint:
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-d '{
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
"messages": [
{
"role": "user",
"content": "Hello!"
}
]
}'
To which you'd get something like:
{
"id": "chatcmpl-guMTxWgpFf",
"object": "chat.completion",
"created": 1694623155,
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": " Hello! It's nice to meet you. Is there something I can help you with or would you like to chat for a bit?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 15,
"completion_tokens": 16,
"total_tokens": 31
}
}
You can also perform a streaming request by passing "stream": true
:
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-d '{
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
"stream": true,
"messages": [
{
"role": "user",
"content": "Hello!"
}
]
}'
to which you'd get a sequence of SSE events, finishing with [DONE]
.
data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "mistralai/Mixtral-8x7B-Instruct-v0.1", "choices": [{"index": 0, "delta": {"role": "assistant", "content": " "}, "finish_reason": null}]}
data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "mistralai/Mixtral-8x7B-Instruct-v0.1", "choices": [{"index": 0, "delta": {"role": "assistant", "content": " Hi"}, "finish_reason": null}]}
data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "mistralai/Mixtral-8x7B-Instruct-v0.1", "choices": [{"index": 0, "delta": {"role": "assistant", "content": "!"}, "finish_reason": null}]}
data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "mistralai/Mixtral-8x7B-Instruct-v0.1", "choices": [{"index": 0, "delta": {"role": "assistant", "content": ""}, "finish_reason": null}]}
data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "mistralai/Mixtral-8x7B-Instruct-v0.1", "choices": [{"index": 0, "delta": {"role": "assistant", "content": "</s>"}, "finish_reason": null}]}
data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "mistralai/Mixtral-8x7B-Instruct-v0.1", "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}
data: [DONE]
Currently supported parameters:
temperature
- more or less random generationtop_p
- controls token samplingmax_tokens
- maximum number of generated tokensstop
- up to 4 strings to terminate generation earliern
- number of sequences to generate (up to 2)Known caveats:
messages
arrayconversation messages: (user,assistant,tool)*,user including one system message anywhere
temperature
numberWhat sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic
Default value: 1
Range: 0 ≤ temperature ≤ 2
max_tokens
integerThe maximum number of tokens to generate in the chat completion. The total length of input tokens and generated tokens is limited by the model's context length.If not set or None defaults to model's max context length minus input length.
Default value: 512
Range: 0 ≤ max_tokens ≤ 100000
n
integernumber of sequences to return. n != 1 incompatible with streaming
Default value: 1
Range: 1 ≤ n ≤ 2
presence_penalty
numberPositive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
Default value: 0
Range: -2 ≤ presence_penalty ≤ 2
frequency_penalty
numberPositive values penalize new tokens based on how many times they appear in the text so far, increasing the model's likelihood to talk about new topics.
Default value: 0
Range: -2 ≤ frequency_penalty ≤ 2
tool_choice
stringControls which (if any) function is called by the model. none means the model will not call a function and instead generates a message. auto means the model can pick between generating a message or calling a function. specifying a particular function choice is not supported currently.none is the default when no functions are present. auto is the default if functions are present.
repetition_penalty
numberAlternative penalty for repetition, but multiplicative instead of additive (> 1 penalize, < 1 encourage)
Default value: 1