Documentation

OpenAI API

We offer OpenAI compatible API for all recent LLM models and all Embeddings Models.

The APIs we support are:

  • chat completion, both streaming and regular, supported for all chat-tuned LLMs
  • completion, both streaming and regular, supported for all LLMs (chat tuned or not)
  • embeddings -- supported for all embeddings models.

The api_base is https://api.deepinfra.com/v1/openai.

Example with recent python client

pip install 'openai>=1.0.0'
from openai import OpenAI

client = OpenAI(
        api_key="<YOUR DEEPINFRA TOKEN: deepctl auth token or get one from https://deepinfra.com/dash/api_keys > ",
        base_url="https://api.deepinfra.com/v1/openai")

stream = True # or False

MODEL_DI = "meta-llama/Llama-2-70b-chat-hf"
chat_completion = client.chat.completions.create(model=MODEL_DI,
    messages=[{"role": "user", "content": "Hello world"}],
    stream=stream,
    max_tokens=100)

if stream:
    # print the chat completion
    for event in chat_completion:
        print(event.choices)
else:
    print(chat_completion.choices[0].message.content)

You can of course use regular HTTP:

export TOKEN="$(deepctl auth token)"
export URL_DI="https://api.deepinfra.com/v1/openai/chat/completions"
export MODEL_DI="meta-llama/Llama-2-70b-chat-hf"

curl "$URL_DI" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
      "stream": true,
      "model": "'$MODEL_DI'",
      "messages": [
        {
          "role": "user",
          "content": "Hello!"
        }
      ],
      "max_tokens": 100
    }'

If you're already using OpenAI's chat completion endpoint you can just set the base_url, the api token and change the model name, and you're good to go.

Example with legacy python client

pip install 'openai<1.0.0'
import openai

stream = True # or False

# Point OpenAI client to our endpoint
openai.api_key = "<YOUR DEEPINFRA TOKEN: deepctl auth token>"
openai.api_base = "https://api.deepinfra.com/v1/openai"

MODEL_DI = "meta-llama/Llama-2-70b-chat-hf"
chat_completion = openai.ChatCompletion.create(
    model=MODEL_DI,
    messages=[{"role": "user", "content": "Hello world"}],
    stream=stream,
    max_tokens=100,
    # top_p=0.5,
)

if stream:
    # print the chat completion
    for event in chat_completion:
        print(event.choices)
else:
    print(chat_completion.choices[0].message.content)

Model parameter

Some models have more than one version available, you can infer against a particular version by specifying {"model": "MODEL_NAME:VERSION", ...} format.

You could also infer against a deploy_id, by using {"model": "deploy_id:DEPLOY_ID", ...}. This is especially useful for Custom LLMs, you can infer before the deployment is running (and before you have the model-name+version pair).

Caveats

Please note that we're not yet 100% compatible, drop us a line in discord if you'd like us to prioritize something missing. Supported request attributes:

ChatCompletions and Completions:

  • model, including specifying version/deploy_id support
  • messages (roles system, user, assistant)
  • max_tokens
  • stream
  • temperature
  • top_p
  • stop
  • n
  • presence_penalty
  • frequency_penalty
  • response_format ({"type": "json"} only)
  • tools, tool_choice
  • echo, logprobs -- only for (non chat) completions

Embeddings:

  • model
  • input
  • encoding_format -- float only