MistralLite is a fine-tuned Mistral-7B-v0.1 language model, with enhanced capabilities of processing long context (up to 32K tokens). By utilizing an adapted Rotary Embedding and sliding window during fine-tuning, MistralLite is able to perform significantly better on several long context retrieve and answering tasks, while keeping the simple model structure of the original model.

To query this model you need to provide a properly formatted input string.

```bash
curl "https://api.deepinfra.com/v1/inference/amazon/MistralLite" \
   -H "Content-Type: application/json" \
   -H "Authorization: Bearer $(deepctl auth token)" \
   -d '{
     "input": "[INST] Just say hi! [/INST] "
   }'
```

That will respond with:

```json
{
    "request_id": "RWZDRhS5kdoM1XWwXLEshynO",
    "inference_status": {
        "status": "succeeded",
        "runtime_ms": 243,
        "cost": 0.0,
        "tokens_generated": 3
    },
    "results": [
        {
            "generated_text": "Hi!"
        }
    ],
    "num_tokens": 3
}
```

To do a streaming request, just pass `"stream": true`:

```bash
curl "https://api.deepinfra.com/v1/inference/amazon/MistralLite" \
   -H "Content-Type: application/json" \
   -H "Authorization: Bearer $(deepctl auth token)" \
   -d '{
     "input": "[INST] Just say hi! [/INST] ",
     "stream": true
   }'
```

which outputs:

```json
data: {"token": {"id": 6324, "text": " Hi", "logprob": 0.0, "special": false}, "generated_text": null, "details": null}

data: {"token": {"id": 29991, "text": "!", "logprob": 0.0, "special": false}, "generated_text": null, "details": null}

data: {"token": {"id": 2, "text": "</s>", "logprob": -0.22229004, "special": true}, "generated_text": "Hi!", "details": {"finish_reason": "eos_token", "generated_tokens": 3, "input_tokens": 13, "seed": 16848278268029293276}}
```


The basic format of the input is:

```
[INST] first question [/INST] first answer</s><s>
[INST] second question [/INST] second answer</s><s>
[INST] final question [/INST]
```

If you want to add system prompt, modify the first question (newlines matter)

```
[INST] <<SYS>>
your system prompt goes here
<<SYS>>

first question [/INST] ...
```

For airoboros the prompt can be:

```
A chat.
USER: question
ASSISTANT:
```

Just stick an extra newline between prompts for history. Check [airoboros
prompt
format](https://huggingface.co/jondurbin/airoboros-l2-7b-2.2/blob/main/README.md#prompt-format)
for more info.


You can use our command-line tool [deepctl](/docs/getting-started) to run
inferences:

```bash
deepctl infer \
    -m 'amazon/MistralLite'  \
    -i 'input=I have this dream'
```

which will give you back something similar to:

```json
{
  "results": [
    {
      "generated_text": "I have this dream about the day I got a job at a tech company. I just woke up on a plane. I sat down on the floor and started getting work done. After getting up around 6 p.m., I looked around and"
    }
  ],
  "num_tokens": 42,
  "num_input_tokens": 100,
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0
  }
}

```


You can POST to our OpenAI compatible endpoint:

```bash
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(deepctl auth token)" \
  -d '{
      "model": "amazon/MistralLite",
      "messages": [
        {
          "role": "user",
          "content": "Hello!"
        }
      ]
    }'
```

To which you'd get something like:

```json
{
    "id": "chatcmpl-guMTxWgpFf",
    "object": "chat.completion",
    "created": 1694623155,
    "model": "amazon/MistralLite",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": " Hello! It's nice to meet you. Is there something I can help you with or would you like to chat for a bit?"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 15,
        "completion_tokens": 16,
        "total_tokens": 31
    }
}
```

You can also perform a streaming request by passing `"stream": true`:

```bash
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(deepctl auth token)" \
  -d '{
      "model": "amazon/MistralLite",
      "stream": true,
      "messages": [
        {
          "role": "user",
          "content": "Hello!"
        }
      ]
    }'
```

to which you'd get a sequence of [SSE](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events) events, finishing with `[DONE]`.

```
data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "amazon/MistralLite", "choices": [{"index": 0, "delta": {"role": "assistant", "content": " "}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "amazon/MistralLite", "choices": [{"index": 0, "delta": {"role": "assistant", "content": " Hi"}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "amazon/MistralLite", "choices": [{"index": 0, "delta": {"role": "assistant", "content": "!"}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "amazon/MistralLite", "choices": [{"index": 0, "delta": {"role": "assistant", "content": ""}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "amazon/MistralLite", "choices": [{"index": 0, "delta": {"role": "assistant", "content": "</s>"}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "amazon/MistralLite", "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}

data: [DONE]
```

Currently supported parameters:
- `temperature` - more or less random generation
- `top_p` - controls token sampling
- `max_tokens` - maximum number of generated tokens
- `stop` - up to 4 strings to terminate generation earlier
- `n` - number of sequences to generate (up to 2)

Known caveats:
- if the generation is terminated due to a stop sequence, the stop sequence is
  present in the output (but in OpenAI it is not).


You can use the official openai python client to run inferences with us:

```python
# Assume openai>=1.0.0
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
    api_key="<YOUR DEEPINFRA TOKEN: deepctl auth token>",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="amazon/MistralLite",
    messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
```

You can also use the streaming option:

```python
chat_completion = openai.chat.completions.create(
    model="amazon/MistralLite",
    messages=[{"role": "user", "content": "Hello"}],
    stream=True,
)

for event in chat_completion:
    print(event.choices[0].delta.content)
```

Currently supported parameters:
- `temperature` - more or less random generation
- `top_p` - controls token sampling
- `max_tokens` - maximum number of generated tokens
- `stop` - up to 4 strings to terminate generation earlier
- `n` - number of sequences to generate (up to 2)

Known caveats:
- if the generation is terminated due to a stop sequence, the stop sequence is
  present in the output (but in OpenAI it is not).


You can use JavaScript in the browser or node.js to make requests with us:


```javascript
// for node.js before v21, you can use node-fetch package
// import fetch from 'node-fetch'

const API_KEY = "<YOUR DEEPINFRA TOKEN>";

const response = await fetch('https://api.deepinfra.com/v1/openai/chat/completions', {
    method: 'POST',
    body: JSON.stringify({
        model: "amazon/MistralLite",
        messages: [{role: "user", content: "Hello"}],
        max_tokens: 20,
    }),
    headers: {
        "Content-Type": "application/json",
        authorization: `Bearer ${API_KEY}`,
    }
});
const data = await response.json();

console.log(data.choices[0].message.content);
console.log(data.usage.prompt_tokens, data.usage.completion_tokens);
```


input

maximum length of the newly generated generated text.If not set or None defaults to model's max context length minus input length.

max_new_tokens

temperature to use for sampling. 0 means the output is deterministic. Values greater than 1 encourage more diversity

temperature

Sample from the set of tokens with highest probability such that sum of probabilies is higher than p. Lower values focus on the most probable tokens.Higher values sample more low-probability tokens

top_p

Sample from the best k (number of) tokens. 0 means off

top_k

repetition penalty. Value of 1 means no penalty, values greater than 1 discourage repetition, smaller than 1 encourage repetition.

repetition_penalty

Up to 4 strings that will terminate generation immediately

stop

Number of output sequences to return. Incompatible with streaming

num_responses

Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

presence_penalty

Positive values penalize new tokens based on how many times they appear in the text so far, increasing the model's likelihood to talk about new topics.

frequency_penalty

The webhook to call when inference is done, by default you will get the output in the response of your inference request

webhook

Whether to stream tokens, by default it will be false, currently only supported for Llama 2 text generation models, token by token updates will be sent over SSE

stream

Frequency Penalty

Input

Max New Tokens

Num Responses

Presence Penalty

Repetition Penalty

Stop

Stream

Temperature

Top K

Top P

Webhook

TextGenerationIn

I have this dream about the day I got a job at a tech company. I just woke up on a plane. I sat down on the floor and started getting work done. After getting up around 6 p.m., I looked around and