Documentation
Inference
Contents
Inference
Simple, scalable and cost-effective inference API is the main feature of DeepInfra. We package state-of-the-art models into a simple rest API that you can use to build your applications.
There are multiple ways to access the API with different endpoints. You can choose the one that suits you best.
For LLMs there is the convenient OpenAI Chat Completions API, and the legacy OpenAI Completions API. Embedding models also support the OpenAI APIs.
These can be accessed at the following endpoint
https://api.deepinfra.com/v1/openai
This endpoint works with HTTP/Curl requests as well as with the official OpenAI libraries for Python & Node.js.
You can learn more here
Every model also has a dedicated inference endpoint.
https://api.deepinfra.com/v1/inference/{model_name}
for example, for meta-llama/Meta-Llama-3-8B-Instruct
the endpoint is
https://api.deepinfra.com/v1/inference/meta-llama/Meta-Llama-3-8B-Instruct
These endpoints can be accessed with REST requests as well as with the official DeepInfra Node.js library
However, bare in mind that for certain cases, like LLMs, this API is more advanced and harder to uses than the messaging OpenAI Chat Completions API.
All LLM models support streaming with all APIs and libraries, you just have to pass the stream
option.
You can see many examples in the API section of every model.
DeepInfra requires an API token to access any of its APIs. You can find yours in the dashboard
To authenticate your requests, you need to pass your API token in the
Authorization
header with type Bearer
.
Authorization: bearer $AUTH_TOKEN
or pass it as a parameter to the appropriate library.
Our inference API supports multipart/form-data
and application/json
content types.
We strongly suggest to use the latter whenever possible.
Using multipart/form-data
makes sense when you want to send binary data
such as media files. Using this content type requires less bandwidth and
is more efficient for large files.
Using application/json
makes sense when you want to send text data.
You can also use this content type for binary data, using data urls.
For example:
{
"image": "..."
}
We use standard HTTP status codes to indicate the status of the request.
200
- OK. The request was successful.4xx
- Bad Request. The request was invalid or cannot be served.5xx
- Internal Server Error. Something went wrong on our side.The response body is always a JSON object containing the model output.
It also contains metadata about the inference request like request_id
, cost
, runtime_ms
(except for LLMs), tokens_input
, tokens_generated
(LLMs only).
Example response:
{
"request_id": "RfMWDr1NXCd7cnaegcm3A8q0",
"inference_status": {
"cost": 0.004639499820768833,
"runtime_ms": 1285,
"status": "succeeded"
},
"text": "Hello World"
}