Documentation
OpenAI API
Contents
OpenAI API
We offer OpenAI compatible API for all LLM models and all Embeddings models.
The APIs we support are:
The endpoint for the OpenAI APIs is https://api.deepinfra.com/v1/openai
.
You can do HTTP requests. You can also use the official Python and Node.js libraries. In all cases streaming is also supported.
For Python you should run
pip install openai
For JavaScript/Node.js you should run
npm install openai
The Chat Completions API is the easiest to use. You exchange messages and it just works. You can change the model to another LLM and it will continue working.
from openai import OpenAI
openai = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
stream = True # or False
chat_completion = openai.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "Hello"}],
stream=stream,
)
if stream:
for event in chat_completion:
if event.choices[0].finish_reason:
print(event.choices[0].finish_reason,
event.usage['prompt_tokens'],
event.usage['completion_tokens'])
else:
print(event.choices[0].delta.content)
else:
print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)
import OpenAI from "openai";
const openai = new OpenAI({
apiKey: "$DEEPINFRA_TOKEN",
baseURL: 'https://api.deepinfra.com/v1/openai',
});
const stream = false; // or true
async function main() {
const completion = await openai.chat.completions.create({
messages: [{ role: "user", content: "Hello" }],
model: "meta-llama/Meta-Llama-3-8B-Instruct",
stream: stream,
});
if (stream) {
for await (const chunk of completion) {
if (chunk.choices[0].finish_reason) {
console.log(chunk.choices[0].finish_reason,
chunk.usage.prompt_tokens,
chunk.usage.completion_tokens);
} else {
console.log(chunk.choices[0].delta.content);
}
}
} else {
console.log(completion.choices[0].message.content);
console.log(completion.usage.prompt_tokens, completion.usage.completion_tokens);
}
}
main();
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"stream": true,
"messages": [
{
"role": "user",
"content": "Hello!"
}
]
}'
You can see more complete examples at the documentation page of each model.
To create a longer chat-like conversation you have to add each response message and each of the user's messages to every request. This way the model will have the context and will be able to provide better answers. You can tweak it even further by providing a system message.
from openai import OpenAI
openai = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
stream = True # or False
chat_completion = openai.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "system", "content": "Respond like a michelin starred chef."},
{"role": "user", "content": "Can you name at least two different techniques to cook lamb?"},
{"role": "assistant", "content": "Bonjour! Let me tell you, my friend, cooking lamb is an art form, and I'm more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic \"Sous Vide\" method. Next, we have the ancient art of \"Sous le Sable\". And finally, we have the more modern technique of \"Hot Smoking.\""},
{"role": "user", "content": "Tell me more about the second method."},
],
stream=stream,
)
if stream:
for event in chat_completion:
if event.choices[0].finish_reason:
print(event.choices[0].finish_reason,
event.usage['prompt_tokens'],
event.usage['completion_tokens'])
else:
print(event.choices[0].delta.content)
else:
print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)
import OpenAI from "openai";
const openai = new OpenAI({
baseURL: 'https://api.deepinfra.com/v1/openai',
apiKey: "$DEEPINFRA_TOKEN",
});
const stream = false; // or true
async function main() {
const completion = await openai.chat.completions.create({
messages: [
{role: "system", content: "Respond like a michelin starred chef."},
{role: "user", content: "Can you name at least two different techniques to cook lamb?"},
{role: "assistant", content: "Bonjour! Let me tell you, my friend, cooking lamb is an art form, and I'm more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic \"Sous Vide\" method. Next, we have the ancient art of \"Sous le Sable\". And finally, we have the more modern technique of \"Hot Smoking.\""},
{role: "user", "content": "Tell me more about the second method."}
],
model: "meta-llama/Meta-Llama-3-8B-Instruct",
stream: stream,
});
if (stream) {
for await (const chunk of completion) {
if (chunk.choices[0].finish_reason) {
console.log(chunk.choices[0].finish_reason,
chunk.usage.prompt_tokens,
chunk.usage.completion_tokens);
} else {
console.log(chunk.choices[0].delta.content);
}
}
} else {
console.log(completion.choices[0].message.content);
console.log(completion.usage.prompt_tokens, completion.usage.completion_tokens);
}
}
main();
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"stream": true,
"messages": [
{
"role": "system",
"content": "Respond like a michelin starred chef."
},
{
"role": "user",
"content": "Can you name at least two different techniques to cook lamb?"
},
{
"role": "assistant",
"content": "Bonjour! Let me tell you, my friend, cooking lamb is an art form, and I'"'"'m more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic \"Sous Vide\" method. Next, we have the ancient art of \"Sous le Sable\". And finally, we have the more modern technique of \"Hot Smoking.\""
},
{
"role": "user",
"content": "Tell me more about the second method."
}
]
}'
The longer the conversation gets, the more time it takes the model to generate the response. The number of messages that you can have in a conversation is limited by the context size of a model. Larger models also usually take more time to respond and are more expensive.
This is an advanced API. You should know how to format the input to make it work. Different models might have a different input format. The example below is for meta-llama/Meta-Llama-3-8B-Instruct. You can see the model's input format in the API section on its page.
from openai import OpenAI
openai = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
stream = True # or False
completion = openai.completions.create(
model='meta-llama/Meta-Llama-3-8B-Instruct',
prompt='<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n',
stop=['<|eot_id|>'],
stream=stream,
)
if stream:
for event in completion:
if event.choices[0].finish_reason:
print(event.choices[0].finish_reason,
event.usage.prompt_tokens,
event.usage.completion_tokens)
else:
print(event.choices[0].text)
else:
print(completion.choices[0].text)
print(completion.usage.prompt_tokens, completion.usage.completion_tokens)
import OpenAI from "openai";
const openai = new OpenAI({
baseURL: 'https://api.deepinfra.com/v1/openai',
apiKey: "$DEEPINFRA_TOKEN",
});
stream = true // or false
async function main() {
const completion = await openai.completions.create({
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"prompt": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
"stream": stream,
"stop": [
"<|eot_id|>"
]
});
if (stream) {
for await (const chunk of completion) {
if (chunk.choices[0].finish_reason) {
console.log(chunk.choices[0].finish_reason,
chunk.usage.prompt_tokens,
chunk.usage.completion_tokens);
} else {
console.log(chunk.choices[0].text);
}
}
} else {
console.log(completion.choices[0].text);
console.log(completion.usage.prompt_tokens, completion.usage.completion_tokens);
}
}
main();
curl "https://api.deepinfra.com/v1/openai/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"prompt": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
"stop": [
"<|eot_id|>"
]
}'
For every model you can check its input format in the API section on its page.
DeepInfra supports the OpenAI embeddings API. The following creates an embedding vector representing the input text
from openai import OpenAI
openai = OpenAI(
api_key="$DEEPINFRA_TOKEN",
base_url="https://api.deepinfra.com/v1/openai",
)
input = "The food was delicious and the waiter...", # or an array ["hello", "world"]
embeddings = openai.embeddings.create(
model="BAAI/bge-large-en-v1.5",
input=input,
encoding_format="float"
)
if isinstance(input, str):
print(embeddings.data[0].embedding)
else:
for i in range(len(input)):
print(embeddings.data[i].embedding)
print(embeddings.usage.prompt_tokens)
import OpenAI from "openai";
const openai = new OpenAI({
baseURL: 'https://api.deepinfra.com/v1/openai',
apiKey: "$DEEPINFRA_TOKEN",
});
const input = "The quick brown fox jumped over the lazy dog" // or an array ["hello", "world"]
async function main() {
const embedding = await openai.embeddings.create({
model: "BAAI/bge-large-en-v1.5",
input: input,
encoding_format: "float",
});
// check if input is a string or array
if (typeof input === "string") {
console.log(embedding.data[0].embedding);
} else {
console.log(embedding.data.map((data) => data.embedding));
}
console.log(embedding.usage.prompt_tokens);
}
main();
curl "https://api.deepinfra.com/v1/openai/embeddings" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-d '{
"input": "The food was delicious and the waiter...",
"model": "BAAI/bge-large-en-v1.5",
"encoding_format": "float"
}'
Some models have more than one version available, you can infer against
a particular version by specifying {"model": "MODEL_NAME:VERSION", ...}
format.
You could also infer against a deploy_id
, by using {"model": "deploy_id:DEPLOY_ID", ...}
. This is especially useful for
Custom LLMs, you can infer before the deployment is
running (and before you have the model-name+version pair).
Please note that we might not be 100% compatible yet, let us know in discord or by email if something you require is missing. Supported request attributes:
ChatCompletions and Completions:
model
, including specifying version
/deploy_id
supportmessages
(roles system
, user
, assistant
)max_tokens
stream
temperature
top_p
stop
n
presence_penalty
frequency_penalty
response_format
({"type": "json"}
only, it will return default format when omitted)tools
, tool_choice
echo
, logprobs
-- only for (non chat) completionsdeploy_id
might not be immediately avaiable if the model is currently deploying
Embeddings:
model
input
encoding_format
-- float
onlyYou can see even more details on each model's page.