DeepInfra/pygmalion-13b-4bit-128g cover image


A model for fictional writing and entertainment purposes

A model for fictional writing and entertainment purposes

$0.22 / Mtoken


To query this model you need to provide a properly formatted input string.

curl "" \
   -H "Content-Type: application/json" \
   -H "Authorization: Bearer $(deepctl auth token)" \
   -d '{
     "input": "[INST] Just say hi! [/INST] "

That will respond with:

    "request_id": "RWZDRhS5kdoM1XWwXLEshynO",
    "inference_status": {
        "status": "succeeded",
        "runtime_ms": 243,
        "cost": 0.0,
        "tokens_generated": 3
    "results": [
            "generated_text": "Hi!"
    "num_tokens": 3

To do a streaming request, just pass "stream": true:

curl "" \
   -H "Content-Type: application/json" \
   -H "Authorization: Bearer $(deepctl auth token)" \
   -d '{
     "input": "[INST] Just say hi! [/INST] ",
     "stream": true

which outputs:

data: {"token": {"id": 6324, "text": " Hi", "logprob": 0.0, "special": false}, "generated_text": null, "details": null}

data: {"token": {"id": 29991, "text": "!", "logprob": 0.0, "special": false}, "generated_text": null, "details": null}

data: {"token": {"id": 2, "text": "</s>", "logprob": -0.22229004, "special": true}, "generated_text": "Hi!", "details": {"finish_reason": "eos_token", "generated_tokens": 3, "input_tokens": 13, "seed": 16848278268029293276}}

The basic format of the input is:

[INST] first question [/INST] first answer</s><s>
[INST] second question [/INST] second answer</s><s>
[INST] final question [/INST]

If you want to add system prompt, modify the first question (newlines matter)

[INST] <<SYS>>
your system prompt goes here

first question [/INST] ...

For airoboros the prompt can be:

A chat.
USER: question

Just stick an extra newline between prompts for history. Check airoboros prompt format for more info.

Input fields


text to generate from


maximum length of the newly generated generated text.If not set or None defaults to model's max context length minus input length.

Default value: 512

Range: 1 ≤ max_new_tokens ≤ 100000


temperature to use for sampling. 0 means the output is deterministic. Values greater than 1 encourage more diversity

Default value: 0.7

Range: 0 ≤ temperature ≤ 100


Sample from the set of tokens with highest probability such that sum of probabilies is higher than p. Lower values focus on the most probable tokens.Higher values sample more low-probability tokens

Default value: 0.9

Range: 0 < top_p ≤ 1


Sample from the best k (number of) tokens. 0 means off

Default value: 0

Range: 0 ≤ top_k < 100000


repetition penalty. Value of 1 means no penalty, values greater than 1 discourage repetition, smaller than 1 encourage repetition.

Default value: 1

Range: 0.01 ≤ repetition_penalty ≤ 5


Up to 16 strings that will terminate generation immediately


Number of output sequences to return. Incompatible with streaming

Default value: 1

Range: 1 ≤ num_responses ≤ 2


Optional nested object with "type" set to "json_object"

Default value: [object Object]


Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

Default value: 0

Range: -2 ≤ presence_penalty ≤ 2


Positive values penalize new tokens based on how many times they appear in the text so far, increasing the model's likelihood to talk about new topics.

Default value: 0

Range: -2 ≤ frequency_penalty ≤ 2


The webhook to call when inference is done, by default you will get the output in the response of your inference request


Whether to stream tokens, by default it will be false, currently only supported for Llama 2 text generation models, token by token updates will be sent over SSE

Default value: false

Input Schema

Output Schema