WizardLM-2 7B is the smaller variant of Microsoft AI's latest Wizard model. It is the fastest and achieves comparable performance with existing 10x larger open-source leading models

WizardLM-2-7B

WizardLM-2 8x22B is Microsoft AI's most advanced Wizard model. It demonstrates highly competitive performance compared to those leading proprietary models.

WizardLM-2-8x22B

Zephyr 141B-A35B is an instruction-tuned (assistant) version of Mixtral-8x22B. It was fine-tuned on a mix of publicly available, synthetic datasets. It achieves strong performance on chat benchmarks.

zephyr-orpo-141b-A35b-v0.1

Mixtral-8x22B is the latest and largest mixture of expert large language model (LLM) from Mistral AI. This is state of the art machine learning model using a mixture 8 of experts (MoE) 22b models. During inference 2 expers are selected. This architecture allows large models to be fast and cheap at inference.  This model is not instruction tuned. 

Mixtral-8x22B-v0.1

Gemma is an open-source model designed by Google. This is Gemma 1.1 7B (IT), an update over the original instruction-tuned Gemma release. Gemma 1.1 was trained using a novel RLHF method, leading to substantial gains on quality, coding capabilities, factuality, instruction following and multi-turn conversation quality.

gemma-1.1-7b-it

DBRX is an open source LLM created by Databricks. It uses mixture-of-experts (MoE) architecture with 132B total parameters of which 36B parameters are active on any input. It outperforms existing open source LLMs like Llama 2 70B and Mixtral-8x7B on standard industry benchmarks for language understanding, programming, math, and logic.

dbrx-instruct

Mixtral is mixture of expert large language model (LLM) from Mistral AI. This is state of the art machine learning model using a mixture 8 of experts (MoE) 7b models. During inference 2 expers are selected. This architecture allows large models to be fast and cheap at inference. The Mixtral-8x7B outperforms Llama 2 70B on most benchmarks.

Mixtral-8x7B-Instruct-v0.1

The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is a instruct fine-tuned version of the Mistral-7B-v0.2 generative text model using a variety of publicly available conversation datasets.

Mistral-7B-Instruct-v0.2

LLaMa 2 is a collections of LLMs trained by Meta. This is the 70B chat optimized version. This endpoint has per token pricing.

Llama-2-70b-chat-hf

The Dolphin 2.6 Mixtral 8x7b model is a finetuned version of the Mixtral-8x7b model, trained on a variety of data including coding data, for 3 days on 4 A100 GPUs. It is uncensored and requires trust_remote_code. The model is very obedient and good at coding, but not DPO tuned. The dataset has been filtered for alignment and bias. The model is compliant with user requests and can be used for various purposes such as generating code or engaging in general chat.

dolphin-2.6-mixtral-8x7b

A Mythomax/MLewd_13B-style merge of selected 70B models  A multi-model merge of several  LLaMA2 70B finetunes for roleplaying and creative work. The goal was to create a model that combines creativity with intelligence for an enhanced experience.

lzlv_70b_fp16_hf

OpenChat is a library of open-source language models that have been fine-tuned with C-RLFT, a strategy inspired by offline reinforcement learning. These models can learn from mixed-quality data without preference labels and have achieved exceptional performance comparable to ChatGPT. The developers of OpenChat are dedicated to creating a high-performance, commercially viable, open-source large language model and are continuously making progress towards this goal.

openchat_3.5

LLaVa is a multimodal model that supports vision and language models combined.

llava-1.5-7b-hf

StarCoder2-15B model is a 15B parameter model trained on 600+ programming languages. It specializes in code completion.

starcoder2-15b

A model for fictional writing and entertainment purposes

pygmalion-13b-4bit-128g

Latest version of the Airoboros model fine-tunned version of llama-2-70b using the Airoboros dataset. This model is currently running jondurbin/airoboros-l2-70b-2.2.1 

airoboros-70b

SDXL consists of an ensemble of experts pipeline for latent diffusion: In a first step, the base model is used to generate (noisy) latents, which are then further processed with a refinement model (available here: https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/) specialized for the final denoising steps. Note that the base model can be used as a standalone module.

sdxl

Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 7B fine-tuned model, optimized for dialogue use cases and converted for the Hugging Face Transformers format. 

Llama-2-7b-chat-hf

Most widely used version of Stable Diffusion. Trained on 512x512 images, it can generate realistic images given text description

stable-diffusion-v1-5

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

whisper-large

Text to image model based on Stable Diffusion.

openjourney

BGE embedding is a general Embedding Model. It is pre-trained using retromae and trained on large-scale pair data using contrastive learning. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned

bge-large-en-v1.5

To query this model you need to provide a properly formatted input string.

```bash
curl "https://api.deepinfra.com/v1/inference/DeepInfra/pygmalion-13b-4bit-128g" \
   -H "Content-Type: application/json" \
   -H "Authorization: Bearer $(deepctl auth token)" \
   -d '{
     "input": "[INST] Just say hi! [/INST] "
   }'
```

That will respond with:

```json
{
    "request_id": "RWZDRhS5kdoM1XWwXLEshynO",
    "inference_status": {
        "status": "succeeded",
        "runtime_ms": 243,
        "cost": 0.0,
        "tokens_generated": 3
    },
    "results": [
        {
            "generated_text": "Hi!"
        }
    ],
    "num_tokens": 3
}
```

To do a streaming request, just pass `"stream": true`:

```bash
curl "https://api.deepinfra.com/v1/inference/DeepInfra/pygmalion-13b-4bit-128g" \
   -H "Content-Type: application/json" \
   -H "Authorization: Bearer $(deepctl auth token)" \
   -d '{
     "input": "[INST] Just say hi! [/INST] ",
     "stream": true
   }'
```

which outputs:

```json
data: {"token": {"id": 6324, "text": " Hi", "logprob": 0.0, "special": false}, "generated_text": null, "details": null}

data: {"token": {"id": 29991, "text": "!", "logprob": 0.0, "special": false}, "generated_text": null, "details": null}

data: {"token": {"id": 2, "text": "</s>", "logprob": -0.22229004, "special": true}, "generated_text": "Hi!", "details": {"finish_reason": "eos_token", "generated_tokens": 3, "input_tokens": 13, "seed": 16848278268029293276}}
```


The basic format of the input is:

```
[INST] first question [/INST] first answer</s><s>
[INST] second question [/INST] second answer</s><s>
[INST] final question [/INST]
```

If you want to add system prompt, modify the first question (newlines matter)

```
[INST] <<SYS>>
your system prompt goes here
<<SYS>>

first question [/INST] ...
```

For airoboros the prompt can be:

```
A chat.
USER: question
ASSISTANT:
```

Just stick an extra newline between prompts for history. Check [airoboros
prompt
format](https://huggingface.co/jondurbin/airoboros-l2-7b-2.2/blob/main/README.md#prompt-format)
for more info.


You can use our command-line tool [deepctl](/docs/getting-started) to run
inferences:

```bash
deepctl infer \
    -m 'DeepInfra/pygmalion-13b-4bit-128g'  \
    -i 'input=I have this dream'
```

which will give you back something similar to:

```json
{
  "results": [
    {
      "generated_text": "I have this dream about the day I got a job at a tech company. I just woke up on a plane. I sat down on the floor and started getting work done. After getting up around 6 p.m., I looked around and"
    }
  ],
  "num_tokens": 42,
  "num_input_tokens": 100,
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0
  }
}

```


There is an excellent third-party library for nodejs:
https://github.com/ovuruska/deepinfra-api

```bash
npm install deepinfra-api
```

```javascript
import { TextGenerationBaseModel } from 'deepinfra-api/dist/lib/models/base/text-generation.js';

const DEEPINFRA_API_KEY = '<Your Key Here>';
const MODEL_URL = 'https://api.deepinfra.com/v1/inference/DeepInfra/pygmalion-13b-4bit-128g';

const client = new TextGenerationBaseModel(MODEL_URL, DEEPINFRA_API_KEY);
const res = await client.generate({input: "Hello"});
console.log(res.results[0].generated_text);
```


You can POST to our OpenAI compatible endpoint:

```bash
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(deepctl auth token)" \
  -d '{
      "model": "DeepInfra/pygmalion-13b-4bit-128g",
      "messages": [
        {
          "role": "user",
          "content": "Hello!"
        }
      ]
    }'
```

To which you'd get something like:

```json
{
    "id": "chatcmpl-guMTxWgpFf",
    "object": "chat.completion",
    "created": 1694623155,
    "model": "DeepInfra/pygmalion-13b-4bit-128g",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": " Hello! It's nice to meet you. Is there something I can help you with or would you like to chat for a bit?"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 15,
        "completion_tokens": 16,
        "total_tokens": 31
    }
}
```

You can also perform a streaming request by passing `"stream": true`:

```bash
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(deepctl auth token)" \
  -d '{
      "model": "DeepInfra/pygmalion-13b-4bit-128g",
      "stream": true,
      "messages": [
        {
          "role": "user",
          "content": "Hello!"
        }
      ]
    }'
```

to which you'd get a sequence of [SSE](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events) events, finishing with `[DONE]`.

```
data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "DeepInfra/pygmalion-13b-4bit-128g", "choices": [{"index": 0, "delta": {"role": "assistant", "content": " "}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "DeepInfra/pygmalion-13b-4bit-128g", "choices": [{"index": 0, "delta": {"role": "assistant", "content": " Hi"}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "DeepInfra/pygmalion-13b-4bit-128g", "choices": [{"index": 0, "delta": {"role": "assistant", "content": "!"}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "DeepInfra/pygmalion-13b-4bit-128g", "choices": [{"index": 0, "delta": {"role": "assistant", "content": ""}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "DeepInfra/pygmalion-13b-4bit-128g", "choices": [{"index": 0, "delta": {"role": "assistant", "content": "</s>"}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "DeepInfra/pygmalion-13b-4bit-128g", "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}

data: [DONE]
```

Currently supported parameters:
- `temperature` - more or less random generation
- `top_p` - controls token sampling
- `max_tokens` - maximum number of generated tokens
- `stop` - up to 4 strings to terminate generation earlier
- `n` - number of sequences to generate (up to 2)

Known caveats:
- if the generation is terminated due to a stop sequence, the stop sequence is
  present in the output (but in OpenAI it is not).


You can use the official openai python client to run inferences with us:

```python
# Assume openai>=1.0.0
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
    api_key="<YOUR DEEPINFRA TOKEN: deepctl auth token>",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="DeepInfra/pygmalion-13b-4bit-128g",
    messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
```

You can also use the streaming option:

```python
chat_completion = openai.chat.completions.create(
    model="DeepInfra/pygmalion-13b-4bit-128g",
    messages=[{"role": "user", "content": "Hello"}],
    stream=True,
)

for event in chat_completion:
    print(event.choices[0].delta.content)
```

Currently supported parameters:
- `temperature` - more or less random generation
- `top_p` - controls token sampling
- `max_tokens` - maximum number of generated tokens
- `stop` - up to 4 strings to terminate generation earlier
- `n` - number of sequences to generate (up to 2)

Known caveats:
- if the generation is terminated due to a stop sequence, the stop sequence is
  present in the output (but in OpenAI it is not).


You can use JavaScript in the browser or node.js to make requests with us:


```javascript
// for node.js before v21, you can use node-fetch package
// import fetch from 'node-fetch'

const API_KEY = "<YOUR DEEPINFRA TOKEN>";

const response = await fetch('https://api.deepinfra.com/v1/openai/chat/completions', {
    method: 'POST',
    body: JSON.stringify({
        model: "DeepInfra/pygmalion-13b-4bit-128g",
        messages: [{role: "user", content: "Hello"}],
        max_tokens: 20,
    }),
    headers: {
        "Content-Type": "application/json",
        authorization: `Bearer ${API_KEY}`,
    }
});
const data = await response.json();

console.log(data.choices[0].message.content);
console.log(data.usage.prompt_tokens, data.usage.completion_tokens);
```


input

maximum length of the newly generated generated text.If not set or None defaults to model's max context length minus input length.

max_new_tokens

temperature to use for sampling. 0 means the output is deterministic. Values greater than 1 encourage more diversity

temperature

Sample from the set of tokens with highest probability such that sum of probabilies is higher than p. Lower values focus on the most probable tokens.Higher values sample more low-probability tokens

top_p

Sample from the best k (number of) tokens. 0 means off

top_k

repetition penalty. Value of 1 means no penalty, values greater than 1 discourage repetition, smaller than 1 encourage repetition.

repetition_penalty

Up to 16 strings that will terminate generation immediately

stop

Number of output sequences to return. Incompatible with streaming

num_responses

Optional nested object with "type" set to "json_object"

response_format

Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

presence_penalty

Positive values penalize new tokens based on how many times they appear in the text so far, increasing the model's likelihood to talk about new topics.

frequency_penalty

The webhook to call when inference is done, by default you will get the output in the response of your inference request

webhook

Whether to stream tokens, by default it will be false, currently only supported for Llama 2 text generation models, token by token updates will be sent over SSE

stream

Type

ResponseFormat

Frequency Penalty

Input

Max New Tokens

Num Responses

Presence Penalty

Repetition Penalty

Response Format

Stop

Stream

Temperature

Top K

Top P

Webhook

TextGenerationIn

I have this dream about the day I got a job at a tech company. I just woke up on a plane. I sat down on the floor and started getting work done. After getting up around 6 p.m., I looked around and