Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

whisper-large-v3

Meta developed and released the Meta Llama 3.1 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes

Meta-Llama-3.1-405B-Instruct

Meta-Llama-3.1-70B-Instruct

Meta-Llama-3.1-8B-Instruct

Gemma is a family of lightweight, state-of-the-art open models from Google. Gemma-2-27B delivers the best performance for its size class, and even offers competitive alternatives to models more than twice its size. 

gemma-2-27b-it

Gemma is a family of lightweight, state-of-the-art open models from Google. The 9B Gemma 2 model delivers class-leading performance, outperforming Llama 3 8B and other open models in its size category.

gemma-2-9b-it

Dolphin 2.9.1, a fine-tuned Llama-3-70b model. The new model, trained on filtered data, is more compliant but uncensored. It demonstrates improvements in instruction, conversation, coding, and function calling abilities.

dolphin-2.9.1-llama-3-70b

Euryale 70B v2.1 is a model focused on creative roleplay from Sao10k

L3-70B-Euryale-v2.1

Model Details Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes.

Meta-Llama-3-70B-Instruct

The 72 billion parameter Qwen2 excels in language understanding, multilingual capabilities, coding, mathematics, and reasoning.

Qwen2-72B-Instruct

The Phi-3-Medium-4K-Instruct is a powerful and lightweight language model with 14 billion parameters, trained on high-quality data to excel in instruction following and safety measures. It demonstrates exceptional performance across benchmarks, including common sense, language understanding, and logical reasoning, outperforming models of similar size.

Phi-3-medium-4k-instruct

Openchat 3.6 is a LLama-3-8b fine tune that outperforms it on multiple benchmarks.

openchat-3.6-8b

Mistral-7B-Instruct-v0.3 is an instruction-tuned model, next iteration of of Mistral 7B that has larger vocabulary, newer tokenizer and supports function calling.

Mistral-7B-Instruct-v0.3

Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes.

Meta-Llama-3-8B-Instruct

This is the instruction fine-tuned version of Mixtral-8x22B - the latest and largest mixture of experts large language model (LLM) from Mistral AI. This state of the art machine learning model uses a mixture 8 of experts (MoE) 22b models. During inference 2 experts are selected. This architecture allows large models to be fast and cheap at inference.

Mixtral-8x22B-Instruct-v0.1

WizardLM-2 8x22B is Microsoft AI's most advanced Wizard model. It demonstrates highly competitive performance compared to those leading proprietary models.

WizardLM-2-8x22B

WizardLM-2 7B is the smaller variant of Microsoft AI's latest Wizard model. It is the fastest and achieves comparable performance with existing 10x larger open-source leading models

WizardLM-2-7B

Mixtral is mixture of expert large language model (LLM) from Mistral AI. This is state of the art machine learning model using a mixture 8 of experts (MoE) 7b models. During inference 2 expers are selected. This architecture allows large models to be fast and cheap at inference. The Mixtral-8x7B outperforms Llama 2 70B on most benchmarks.

Mixtral-8x7B-Instruct-v0.1

A Mythomax/MLewd_13B-style merge of selected 70B models  A multi-model merge of several  LLaMA2 70B finetunes for roleplaying and creative work. The goal was to create a model that combines creativity with intelligence for an enhanced experience.

lzlv_70b_fp16_hf

LLaVa is a multimodal model that supports vision and language models combined.

llava-1.5-7b-hf

SDXL consists of an ensemble of experts pipeline for latent diffusion: In a first step, the base model is used to generate (noisy) latents, which are then further processed with a refinement model (available here: https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/) specialized for the final denoising steps. Note that the base model can be used as a standalone module.

sdxl

BGE embedding is a general Embedding Model. It is pre-trained using retromae and trained on large-scale pair data using contrastive learning. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned

bge-large-en-v1.5

Mixtral-8x22B is the latest and largest mixture of expert large language model (LLM) from Mistral AI. This is state of the art machine learning model using a mixture 8 of experts (MoE) 22b models. During inference 2 expers are selected. This architecture allows large models to be fast and cheap at inference.  This model is not instruction tuned. 

This is an advanced and more complex API. We strongly recommend that you use OpenAI Chat Completions instead.

#### Simple prompt

To query this model you need to provide a properly formatted input string.

```bash
curl "https://api.deepinfra.com/v1/inference/mistralai/Mixtral-8x22B-v0.1" \
   -H "Content-Type: application/json" \
   -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
   -d '{
     "input": "<s>[INST] Hello! [/INST]",
     "stop": [
       "</s>"
     ]
   }'
```

That will respond with

```json
{
    "request_id": "RWZDRhS5kdoM1XWwXLEshynO",
    "inference_status": {
        "status": "succeeded",
        "runtime_ms": 243,
        "cost": 0.0000436,
        "tokens_input": 12,
        "tokens_generated": 25
    },
    "results": [
        {
            "generated_text": "Hello! It's nice to meet you. Is there something I can help you with or would you like to chat for a bit?"
        }
    ],
    "num_tokens": 25,
    "num_input_tokens":12
}
```

#### Conversations

The OpenAI Chat Completions API is better suited for chat-like conversations, use it instead.

To query this model you need to provide a properly formatted input string.

However, you can still do it if you really need to. You have to add each response and each of the user prompts to every request.
You need a properly formatted input string to make it understand the current context. See the example below for some of them.
You can tweak it even further by providing a system message.

```bash
curl "https://api.deepinfra.com/v1/inference/mistralai/Mixtral-8x22B-v0.1" \
   -H "Content-Type: application/json" \
   -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
   -d '{
     "input": "<s>[INST] <<SYS>>\nRespond like a michelin starred chef.\n<</SYS>>\n\nCan you name at least two different techniques to cook lamb? [/INST] Bonjour! Let me tell you, my friend, cooking lamb is an art form, and I'"'"'m more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic \"Sous Vide\" method. Next, we have the ancient art of \"Sous le Sable\". And finally, we have the more modern technique of \"Hot Smoking.\" </s><s>[INST] Tell me more about the second method. [/INST]",
     "stop": [
       "</s>"
     ]
   }'
```

The conversation above might return something like the following

```json
{
    "request_id": "RWZDRhS5kdoM1XWwXLEshynO",
    "inference_status": {
        "status": "succeeded",
        "runtime_ms": 243,
        "cost": 0.000436,
        "tokens_input": 149,
        "tokens_generated": 338
    },
    "results": [
        {
            "generated_text": "Sous le Sable, my friend! It's an ancient technique that's been used for centuries in the Middle East and North Africa. The name itself..."
        }
    ],
    "num_tokens": 338,
    "num_input_tokens": 149
}
```

The longer the conversation gets, the more time it takes the model to generate the response. The conversation is limited by the context size of a model. Larger models also usually take more time to respond.

<br>

### Streaming


To do a streaming request, just pass `"stream": true`:

```bash
curl "https://api.deepinfra.com/v1/inference/mistralai/Mixtral-8x22B-v0.1" \
   -H "Content-Type: application/json" \
   -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
   -d '{
     "input": "<s>[INST] Hello! [/INST]",
     "stop": [
       "</s>"
     ],
     "stream": true
   }'
```

which outputs:

```json
data: {"token": {"id": null, "text": "Hello", "logprob": 0.0, "special": false}, "generated_text": "", "details": null, "estimated_cost": null}

data: {"token": {"id": null, "text": "!", "logprob": 0.0, "special": false}, "generated_text": "", "details": null, "estimated_cost": null}

data: {"token": {"id": null, "text": " It", "logprob": 0.0, "special": false}, "generated_text": "", "details": null, "estimated_cost": null}

data: {"token": {"id": null, "text": "'s", "logprob": 0.0, "special": false}, "generated_text": "", "details": null, "estimated_cost": null}

....

data: {"token": {"id": null, "text": "", "logprob": 0.0, "special": false}, "generated_text": null, "details": {"finish_reason": "stop"}, "num_output_tokens": 25, "num_input_tokens": 12, "estimated_cost": 0.0000386}
```

#### Input format

You can see below the basic format of the input. Bear in mind that newlines often matter.

```text
<s>[INST] Hello! [/INST]
```

Conversation prompts contain the history of the exchanged prompts and responses.

```text
<s>[INST] First question [/INST] First answer </s><s>[INST] Second question [/INST] Second answer </s><s>[INST] Final question [/INST]
```

If you want to add system prompt, it is done like this

```text
<s>[INST] <<SYS>>
System prompt
<</SYS>>

First question [/INST] First answer </s><s>[INST] Second question [/INST] Second answer </s><s>[INST] Final question [/INST]
```



You can use our command-line tool [deepctl](/docs/getting-started) to run inferences:

This is an advanced and more complex API. We strongly recommend that you use OpenAI Chat Completions instead.

#### Simple prompt

To query this model you need to provide a properly formatted input string.

```bash
deepctl infer \
    -m 'mistralai/Mixtral-8x22B-v0.1'  \
    -i 'input="<s>[INST] Hello! [/INST]"'  \
    -i 'stop=["</s>"]'
```

That will respond with

```json
{
    "inference_status": {
        "status": "succeeded",
        "runtime_ms": 243,
        "cost": 0.0000436,
        "tokens_input": 12,
        "tokens_generated": 25
    },
    "results": [
        {
            "generated_text": "Hello! It's nice to meet you. Is there something I can help you with or would you like to chat for a bit?"
        }
    ],
    "num_tokens": 25,
    "num_input_tokens":12
}
```

#### Conversations

The OpenAI Chat Completions API is better suited for chat-like conversations, use it instead.

To query this model you need to provide a properly formatted input string.

However, you can still do it if you really need to. You have to add each response and each of the user prompts to every request.
You need a properly formatted input string to make it understand the current context. See the example below for some of them.
You can tweak it even further by providing a system message.

```bash
deepctl infer \
    -m 'mistralai/Mixtral-8x22B-v0.1'  \
    -i 'input="<s>[INST] <<SYS>>\nRespond like a michelin starred chef.\n<</SYS>>\n\nCan you name at least two different techniques to cook lamb? [/INST] Bonjour! Let me tell you, my friend, cooking lamb is an art form, and I'"'"'m more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic \"Sous Vide\" method. Next, we have the ancient art of \"Sous le Sable\". And finally, we have the more modern technique of \"Hot Smoking.\" </s><s>[INST] Tell me more about the second method. [/INST]"'  \
    -i 'stop=["</s>"]'
```

The conversation above might return something like the following

```json
{
    "inference_status": {
        "status": "succeeded",
        "runtime_ms": 243,
        "cost": 0.000436,
        "tokens_input": 149,
        "tokens_generated": 338
    },
    "results": [
        {
            "generated_text": "Sous le Sable, my friend! It's an ancient technique that's been used for centuries in the Middle East and North Africa. The name itself..."
        }
    ],
    "num_tokens": 338,
    "num_input_tokens": 149
}
```

The longer the conversation gets, the more time it takes the model to generate the response. The conversation is limited by the context size of a model. Larger models also usually take more time to respond.

#### Input format

You can see below the basic format of the input. Bear in mind that newlines often matter.

```text
<s>[INST] Hello! [/INST]
```

Conversation prompts contain the history of the exchanged prompts and responses.

```text
<s>[INST] First question [/INST] First answer </s><s>[INST] Second question [/INST] Second answer </s><s>[INST] Final question [/INST]
```

If you want to add system prompt, it is done like this

```text
<s>[INST] <<SYS>>
System prompt
<</SYS>>

First question [/INST] First answer </s><s>[INST] Second question [/INST] Second answer </s><s>[INST] Final question [/INST]
```



We recommend using our NodeJS client https://github.com/deepinfra/deepinfra-node.

You can install it with

```bash
npm install deepinfra
```

#### Simple prompt

To query this model you need to provide a properly formatted input string.

```javascript
import { TextGeneration } from "deepinfra";

const DEEPINFRA_API_KEY = '$DEEPINFRA_TOKEN';
const MODEL_URL = 'https://api.deepinfra.com/v1/inference/mistralai/Mixtral-8x22B-v0.1';

async function main() {
  const client = new TextGeneration(MODEL_URL, DEEPINFRA_API_KEY);
  const res = await client.generate({
    "input": "<s>[INST] Hello! [/INST]",
    "stop": [
      "</s>"
    ]
  });
  console.log(res.results[0].generated_text);
}

main();

// Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
```

#### Conversations

The OpenAI Chat Completions API is better suited for chat-like conversations, use it instead.

However, you can still do it if you really need to. You have to add each response and each of the user prompts to every request.
You need a properly formatted input string to make it understand the current context. See the example below for some of them.
You can tweak it even further by providing a system message.

```javascript
import { TextGeneration } from "deepinfra";

const DEEPINFRA_API_KEY = '$DEEPINFRA_TOKEN';
const MODEL_URL = 'https://api.deepinfra.com/v1/inference/mistralai/Mixtral-8x22B-v0.1';

async function main() {
  const client = new TextGeneration(MODEL_URL, DEEPINFRA_API_KEY);
  const res = await client.generate({
    "input": "<s>[INST] <<SYS>>\nRespond like a michelin starred chef.\n<</SYS>>\n\nCan you name at least two different techniques to cook lamb? [/INST] Bonjour! Let me tell you, my friend, cooking lamb is an art form, and I'm more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic \"Sous Vide\" method. Next, we have the ancient art of \"Sous le Sable\". And finally, we have the more modern technique of \"Hot Smoking.\" </s><s>[INST] Tell me more about the second method. [/INST]",
    "stop": [
      "</s>"
    ]
  });
  console.log(res.results[0].generated_text);
}

main();

// Sous le Sable! It's an ancient technique that never goes out of style, n'est-ce pas? Literally ...
```

The longer the conversation gets, the more time it takes the model to generate the response.
The number of messages that you can have in a conversation is limited by the context size of a model.
Larger models also usually take more time to respond.

#### Input format

You can see below the basic format of the input. Bear in mind that newlines often matter.

```text
<s>[INST] Hello! [/INST]
```

Conversation prompts contain the history of the exchanged prompts and responses.

```text
<s>[INST] First question [/INST] First answer </s><s>[INST] Second question [/INST] Second answer </s><s>[INST] Final question [/INST]
```

If you want to add system prompt, it is done like this

```text
<s>[INST] <<SYS>>
System prompt
<</SYS>>

First question [/INST] First answer </s><s>[INST] Second question [/INST] Second answer </s><s>[INST] Final question [/INST]
```



You can POST to our OpenAI Completions compatible endpoint.

However, this is an advanced and more complex API. We strongly recommend that you use OpenAI Chat Completions instead.

#### Simple prompt

To query this model you need to provide a properly formatted input string.

```bash
curl "https://api.deepinfra.com/v1/openai/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
     "model": "mistralai/Mixtral-8x22B-v0.1",
     "prompt": "<s>[INST] Hello! [/INST]",
     "stop": [
       "</s>"
     ]
   }'
```

To which you'd get something like

```json
{
    "id": "cmpl-1b8401a68c5141eb825f68944dcea2c1",
    "object": "text_completion",
    "created": 1700578595,
    "model": "mistralai/Mixtral-8x22B-v0.1",
    "choices": [
        {
            "index": 0,
            "text": "Hello! It's nice to meet you. Is there something I can help you with or would you like to chat for a bit?",
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 4,
        "total_tokens": 9,
        "completion_tokens": 5,
        "estimated_cost": 0.00035493
    }
}
```

#### Conversations

The OpenAI Chat Completions API is better suited for chat-like conversations, use it instead.

However, you can still do it if you really need to. You have to add each response and each of the user prompts to every request.
You need a properly formatted input string to make it understand the current context. See the example below for some of them.
You can tweak it even further by providing a system message.

```bash
curl "https://api.deepinfra.com/v1/openai/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
     "model": "mistralai/Mixtral-8x22B-v0.1",
     "prompt": "<s>[INST] <<SYS>>\nRespond like a michelin starred chef.\n<</SYS>>\n\nCan you name at least two different techniques to cook lamb? [/INST] Bonjour! Let me tell you, my friend, cooking lamb is an art form, and I'"'"'m more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic \"Sous Vide\" method. Next, we have the ancient art of \"Sous le Sable\". And finally, we have the more modern technique of \"Hot Smoking.\" </s><s>[INST] Tell me more about the second method. [/INST]",
     "stop": [
       "</s>"
     ]
   }'
```

The conversation above might return something like the following

```json
{
    "id": "cmpl-b23a3fb60cde42ce8f24bb980b4dee87",
    "object": "text_completion",
    "created": 1715688169,
    "model": "mistralai/Mixtral-8x22B-v0.1",
    "choices": [
        {
            "index": 0,
            "text": "Sous le Sable, my friend! It's an ancient technique that's been used for centuries in the Middle East and North Africa. The name itself...",
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 149,
        "total_tokens": 487,
        "completion_tokens": 338,
        "estimated_cost": 0.00035493
    }
}
```

The longer the conversation gets, the more time it takes the model to generate the response. The conversation is limited by the context size of a model. Larger models also usually take more time to respond.

<br>

### Streaming

You can also perform a streaming request by passing `"stream": true`:

```bash
curl "https://api.deepinfra.com/v1/openai/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
     "model": "mistralai/Mixtral-8x22B-v0.1",
     "prompt": "<s>[INST] Hello! [/INST]",
     "stop": [
       "</s>"
     ],
     "stream": true
   }'
```

to which you'd get a sequence of [SSE](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events) events, finishing with `[DONE]`.

```
data: {"id": "cmpl-158cbf94cef043c2955172e8062ded3d", "object": "text_completion", "created": 1694623354, "model": "mistralai/Mixtral-8x22B-v0.1", "choices": [{"index": 0, "text": "Hi", "finish_reason": null}]}

data: {"id": "cmpl-158cbf94cef043c2955172e8062ded3d", "object": "text_completion", "created": 1694623354, "model": "mistralai/Mixtral-8x22B-v0.1", "choices": [{"index": 0, "text": "!", "finish_reason": null}]}

data: {"id": "cmpl-158cbf94cef043c2955172e8062ded3d", "object": "text_completion", "created": 1694623354, "model": "mistralai/Mixtral-8x22B-v0.1", "choices": [{"index": 0, "text": "", "finish_reason": null}]}

data: {"id": "cmpl-158cbf94cef043c2955172e8062ded3d", "object": "text_completion", "created": 1694623354, "model": "mistralai/Mixtral-8x22B-v0.1", "choices": [{"index": 0, "text": "", "finish_reason": "stop"}]}

data: [DONE]
```

#### Input format

You can see below the basic format of the input. Bear in mind that newlines often matter.

```text
<s>[INST] Hello! [/INST]
```

Conversation prompts contain the history of the exchanged prompts and responses.

```text
<s>[INST] First question [/INST] First answer </s><s>[INST] Second question [/INST] Second answer </s><s>[INST] Final question [/INST]
```

If you want to add system prompt, it is done like this

```text
<s>[INST] <<SYS>>
System prompt
<</SYS>>

First question [/INST] First answer </s><s>[INST] Second question [/INST] Second answer </s><s>[INST] Final question [/INST]
```



You can use the official openai python client to run inferences with us.

However, this is an advanced and more complex API. We strongly recommend that you use OpenAI Chat Completions instead.

#### Simple prompt

To query this model you need to provide a properly formatted input string.

```python
from openai import OpenAI

# Point OpenAI client to our endpoint, provide token
openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

completion = openai.completions.create(
    model='mistralai/Mixtral-8x22B-v0.1',
    prompt='<s>[INST] Hello! [/INST]',
    stop=['</s>'],
)

print(completion.choices[0].text)
print(completion.usage.prompt_tokens, completion.usage.completion_tokens)

# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
```

#### Conversations

The OpenAI Chat Completions API is better suited for chat-like conversations, use it instead.

However, you can still do it if you really need to. You have to add each response and each of the user prompts to every request.
You need a properly formatted input string to make it understand the current context. See the example below for some of them.
You can tweak it even further by providing a system message.

```python
from openai import OpenAI

# Point OpenAI client to our endpoint, provide token
openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

completion = openai.completions.create(
    model='mistralai/Mixtral-8x22B-v0.1',
    prompt='<s>[INST] <<SYS>>\nRespond like a michelin starred chef.\n<</SYS>>\n\nCan you name at least two different techniques to cook lamb? [/INST] Bonjour! Let me tell you, my friend, cooking lamb is an art form, and I\'m more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic "Sous Vide" method. Next, we have the ancient art of "Sous le Sable". And finally, we have the more modern technique of "Hot Smoking." </s><s>[INST] Tell me more about the second method. [/INST]',
    stop=['</s>'],
)

print(completion.choices[0].text)
print(completion.usage.prompt_tokens, completion.usage.completion_tokens)

# Sous le Sable! It's an ancient technique that never goes out of style, n'est-ce pas? Literally ...
# 149 324
```

The longer the conversation gets, the more time it takes the model to generate the response.
The number of messages that you can have in a conversation is limited by the context size of a model.
Larger models also usually take more time to respond.


<br>

### Streaming

Streaming completions is supported by adding the `stream=True` option.

```python
from openai import OpenAI

# Point OpenAI client to our endpoint, provide token
openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

completion = openai.completions.create(
    model='mistralai/Mixtral-8x22B-v0.1',
    prompt='<s>[INST] Hello! [/INST]',
    stop=['</s>'],
    stream=True,
)

for event in completion:
    if event.choices[0].finish_reason:
        print(event.choices[0].finish_reason, event.usage.prompt_tokens, event.usage.completion_tokens)
    else:
        print(event.choices[0].text)

# Hello
# !
# It
# 's
# nice
# ...
# 11 25
```

#### Input format

You can see below the basic format of the input. Bear in mind that newlines often matter.

```text
<s>[INST] Hello! [/INST]
```

Conversation prompts contain the history of the exchanged prompts and responses.

```text
<s>[INST] First question [/INST] First answer </s><s>[INST] Second question [/INST] Second answer </s><s>[INST] Final question [/INST]
```

If you want to add system prompt, it is done like this

```text
<s>[INST] <<SYS>>
System prompt
<</SYS>>

First question [/INST] First answer </s><s>[INST] Second question [/INST] Second answer </s><s>[INST] Final question [/INST]
```



You can use JavaScript in the browser or node.js to make requests with us.

However, this is an advanced and more complex API. We strongly recommend that you use OpenAI Chat Completions instead.

```bash
npm install openai
```

#### Simple prompt

To query this model you need to provide a properly formatted input string.

```javascript
import OpenAI from "openai";

const openai = new OpenAI({
    baseURL: 'https://api.deepinfra.com/v1/openai',
    apiKey: "$DEEPINFRA_TOKEN",
});

async function main() {
  const completion = await openai.completions.create({
    "model": "mistralai/Mixtral-8x22B-v0.1",
    "prompt": "<s>[INST] Hello! [/INST]",
    "stop": [
      "</s>"
    ]
  });

  console.log(completion.choices[0].text);
  console.log(completion.usage.prompt_tokens, completion.usage.completion_tokens);
}

main();

// Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
// 11 25
```

#### Conversations

The OpenAI Chat Completions API is better suited for chat-like conversations, use it instead.

However, you can still do it if you really need to. You have to add each response and each of the user prompts to every request.
You need a properly formatted input string to make it understand the current context. See the example below for some of them.
You can tweak it even further by providing a system message.

```javascript
import OpenAI from "openai";

const openai = new OpenAI({
    baseURL: 'https://api.deepinfra.com/v1/openai',
    apiKey: "$DEEPINFRA_TOKEN",
});

async function main() {
  const completion = await openai.completions.create({
    "model": "mistralai/Mixtral-8x22B-v0.1",
    "prompt": "<s>[INST] <<SYS>>\nRespond like a michelin starred chef.\n<</SYS>>\n\nCan you name at least two different techniques to cook lamb? [/INST] Bonjour! Let me tell you, my friend, cooking lamb is an art form, and I'm more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic \"Sous Vide\" method. Next, we have the ancient art of \"Sous le Sable\". And finally, we have the more modern technique of \"Hot Smoking.\" </s><s>[INST] Tell me more about the second method. [/INST]",
    "stop": [
      "</s>"
    ]
  });

  console.log(completion.choices[0].text);
  console.log(completion.usage.prompt_tokens, completion.usage.completion_tokens);
}

main();

// Sous le Sable! It's an ancient technique that never goes out of style, n'est-ce pas? Literally ...
// 149 324
```

The longer the conversation gets, the more time it takes the model to generate the response.
The number of messages that you can have in a conversation is limited by the context size of a model.
Larger models also usually take more time to respond.


<br>

### Streaming

Streaming completions is supported by adding the `stream: true` option.

```javascript
import OpenAI from "openai";

const openai = new OpenAI({
    baseURL: 'https://api.deepinfra.com/v1/openai',
    apiKey: "$DEEPINFRA_TOKEN",
});

async function main() {
  const completion = await openai.completions.create({
    "model": "mistralai/Mixtral-8x22B-v0.1",
    "prompt": "<s>[INST] Hello! [/INST]",
    "stop": [
      "</s>"
    ],
    "stream": true
  });

  for await (const chunk of completion) {
    if (chunk.choices[0].finish_reason) {
        console.log(chunk.choices[0].finish_reason, chunk.usage.prompt_tokens, chunk.usage.completion_tokens);
    } else {
        console.log(chunk.choices[0].text);
    }
  }
}

main();

// Hello
// !
// It
// 's
// nice
// ...
// 11 25
```

#### Input format

You can see below the basic format of the input. Bear in mind that newlines often matter.

```text
<s>[INST] Hello! [/INST]
```

Conversation prompts contain the history of the exchanged prompts and responses.

```text
<s>[INST] First question [/INST] First answer </s><s>[INST] Second question [/INST] Second answer </s><s>[INST] Final question [/INST]
```

If you want to add system prompt, it is done like this

```text
<s>[INST] <<SYS>>
System prompt
<</SYS>>

First question [/INST] First answer </s><s>[INST] Second question [/INST] Second answer </s><s>[INST] Final question [/INST]
```



input

maximum length of the newly generated generated text.If explicitly set to None it will be the model's max context length minus input length.

max_new_tokens

temperature to use for sampling. 0 means the output is deterministic. Values greater than 1 encourage more diversity

temperature

Sample from the set of tokens with highest probability such that sum of probabilies is higher than p. Lower values focus on the most probable tokens.Higher values sample more low-probability tokens

top_p

Sample from the best k (number of) tokens. 0 means off

top_k

repetition penalty. Value of 1 means no penalty, values greater than 1 discourage repetition, smaller than 1 encourage repetition.

repetition_penalty

Up to 16 strings that will terminate generation immediately

stop

Number of output sequences to return. Incompatible with streaming

num_responses

Optional nested object with "type" set to "json_object"

response_format

Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

presence_penalty

Positive values penalize new tokens based on how many times they appear in the text so far, increasing the model's likelihood to talk about new topics.

frequency_penalty

The webhook to call when inference is done, by default you will get the output in the response of your inference request

webhook

Whether to stream tokens, by default it will be false, currently only supported for Llama 2 text generation models, token by token updates will be sent over SSE

stream

Type

ResponseFormat

Frequency Penalty

Input

Max New Tokens

Num Responses

Presence Penalty

Repetition Penalty

Response Format

Stop

Stream

Temperature

Top K

Top P

Webhook

TextGenerationIn

I have this dream about the day I got a job at a tech company. I just woke up on a plane. I sat down on the floor and started getting work done. After getting up around 6 p.m., I looked around and

Generated Text

GeneratedText

estimated cost billed for the request in USD

Cost

Runtime Ms

Status

Tokens Generated

Tokens Input

Object containing the status of the inference request

Inference Status

Num Input Tokens

number of generated tokens, excluding prompt

Num Tokens

Request Id

Results

TextGenerationOut

model

conversation messages: (user,assistant,tool)*,user including one system message anywhere

messages

whether to stream the output via SSE or return the full response

What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic

The maximum number of tokens to generate in the chat completion.

The total length of input tokens and generated tokens is limited by the model's context length.If explicitly set to None it will be the model's max context length minus input length.

max_tokens

up to 16 sequences where the API will stop generating further tokens

number of sequences to return. n != 1 incompatible with streaming

A list of tools the model may call. Currently, only functions are supported as a tool.

tools

Controls which (if any) function is called by the model. none means the model will not call a function and instead generates a message. auto means the model can pick between generating a message or calling a function. specifying a particular function choice is not supported currently.none is the default when no functions are present. auto is the default if functions are present.

tool_choice

The format of the response. Currently, only json is supported.

Alternative penalty for repetition, but multiplicative instead of additive (> 1 penalize, < 1 encourage)

ChatCompletionAssistantMessage

ChatCompletionContentPartImage

ChatCompletionContentPartText

ChatCompletionMessageToolCall

ChatCompletionSystemMessage

ChatCompletionToolMessage

ChatCompletionUserMessage

ChatTools

Function

FunctionDefinition

ImageURL

Max Tokens

Messages

Model

Tool Choice

Tools

An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

OpenAIChatCompletionsIn

FinishReason

OpenAIChatCompletionChoice

UsageInfo

a list of chat completion choices, can be more than one

Choices

Created

the object type, which is always chat.completion

Object

Usage

OpenAIChatCompletionOut

ChatMessageRole

OpenAIChatCompletionStreamChoice

OpenAIDeltaMessage

OpenAIDeltaToolCall

OpenAIDeltaToolCallFunction

the object type, which is always chat.completion.chunk

OpenAIChatCompletionStreamOut

input prompt - a single string is currently supported

prompt

The maximum number of tokens to generate in the completion.

The total length of input tokens and generated tokens is limited by the model's context length.If explicitly set to None it will be the model's max context length minus input length.