meta-llama/Llama-2-70b-chat-hf cover image
featured

meta-llama/Llama-2-70b-chat-hf

LLaMa 2 is a collections of LLMs trained by Meta. This is the 70B chat optimized version. This endpoint has per token pricing.

LLaMa 2 is a collections of LLMs trained by Meta. This is the 70B chat optimized version. This endpoint has per token pricing.

Public
$0.64/$0.80 in/out Mtoken
fp16
4k
JSON
PaperLicense

OpenAI-compatible HTTP API

You can POST to our OpenAI compatible endpoint:

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(deepctl auth token)" \
  -d '{
      "model": "meta-llama/Llama-2-70b-chat-hf",
      "messages": [
        {
          "role": "user",
          "content": "Hello!"
        }
      ]
    }'

To which you'd get something like:

{
    "id": "chatcmpl-guMTxWgpFf",
    "object": "chat.completion",
    "created": 1694623155,
    "model": "meta-llama/Llama-2-70b-chat-hf",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": " Hello! It's nice to meet you. Is there something I can help you with or would you like to chat for a bit?"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 15,
        "completion_tokens": 16,
        "total_tokens": 31
    }
}

You can also perform a streaming request by passing "stream": true:

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(deepctl auth token)" \
  -d '{
      "model": "meta-llama/Llama-2-70b-chat-hf",
      "stream": true,
      "messages": [
        {
          "role": "user",
          "content": "Hello!"
        }
      ]
    }'

to which you'd get a sequence of SSE events, finishing with [DONE].

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "meta-llama/Llama-2-70b-chat-hf", "choices": [{"index": 0, "delta": {"role": "assistant", "content": " "}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "meta-llama/Llama-2-70b-chat-hf", "choices": [{"index": 0, "delta": {"role": "assistant", "content": " Hi"}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "meta-llama/Llama-2-70b-chat-hf", "choices": [{"index": 0, "delta": {"role": "assistant", "content": "!"}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "meta-llama/Llama-2-70b-chat-hf", "choices": [{"index": 0, "delta": {"role": "assistant", "content": ""}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "meta-llama/Llama-2-70b-chat-hf", "choices": [{"index": 0, "delta": {"role": "assistant", "content": "</s>"}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "meta-llama/Llama-2-70b-chat-hf", "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}

data: [DONE]

Currently supported parameters:

  • temperature - more or less random generation
  • top_p - controls token sampling
  • max_tokens - maximum number of generated tokens
  • stop - up to 4 strings to terminate generation earlier
  • n - number of sequences to generate (up to 2)

Known caveats:

  • if the generation is terminated due to a stop sequence, the stop sequence is present in the output (but in OpenAI it is not).

Input fields

inputstring

text to generate from


max_new_tokensinteger

maximum length of the newly generated generated text.If not set or None defaults to model's max context length minus input length.

Default value: 512

Range: 1 ≤ max_new_tokens ≤ 100000


temperaturenumber

temperature to use for sampling. 0 means the output is deterministic. Values greater than 1 encourage more diversity

Default value: 0.7

Range: 0 ≤ temperature ≤ 100


top_pnumber

Sample from the set of tokens with highest probability such that sum of probabilies is higher than p. Lower values focus on the most probable tokens.Higher values sample more low-probability tokens

Default value: 0.9

Range: 0 < top_p ≤ 1


top_kinteger

Sample from the best k (number of) tokens. 0 means off

Default value: 0

Range: 0 ≤ top_k < 100000


repetition_penaltynumber

repetition penalty. Value of 1 means no penalty, values greater than 1 discourage repetition, smaller than 1 encourage repetition.

Default value: 1

Range: 0.01 ≤ repetition_penalty ≤ 5


stoparray

Up to 16 strings that will terminate generation immediately


num_responsesinteger

Number of output sequences to return. Incompatible with streaming

Default value: 1

Range: 1 ≤ num_responses ≤ 2


response_formatobject

Optional nested object with "type" set to "json_object"

Default value: [object Object]


presence_penaltynumber

Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

Default value: 0

Range: -2 ≤ presence_penalty ≤ 2


frequency_penaltynumber

Positive values penalize new tokens based on how many times they appear in the text so far, increasing the model's likelihood to talk about new topics.

Default value: 0

Range: -2 ≤ frequency_penalty ≤ 2


webhookfile

The webhook to call when inference is done, by default you will get the output in the response of your inference request


streamboolean

Whether to stream tokens, by default it will be false, currently only supported for Llama 2 text generation models, token by token updates will be sent over SSE

Default value: false

Input Schema

Output Schema