We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

Documentation

Inference

Simple, scalable and cost-effective inference API is the main feature of DeepInfra. We package state-of-the-art models into a simple rest API that you can use to build your applications.

There are multiple ways to access the API with different endpoints. You can choose the one that suits you best.

OpenAI APIs

For LLMs there is the convenient OpenAI Chat Completions API, and the legacy OpenAI Completions API. Embedding models also support the OpenAI APIs.

These can be accessed at the following endpoint

https://api.deepinfra.com/v1/openai
copy

This endpoint works with HTTP/Curl requests as well as with the official OpenAI libraries for Python & Node.js.

You can learn more here

Inference Endpoints

Every model also has a dedicated inference endpoint.

https://api.deepinfra.com/v1/inference/{model_name}
copy

for example, for meta-llama/Meta-Llama-3-8B-Instruct the endpoint is

https://api.deepinfra.com/v1/inference/meta-llama/Meta-Llama-3-8B-Instruct
copy

These endpoints can be accessed with REST requests as well as with the official DeepInfra Node.js library

However, bare in mind that for certain cases, like LLMs, this API is more advanced and harder to uses than the messaging OpenAI Chat Completions API.

Streaming

All LLM models support streaming with all APIs and libraries, you just have to pass the stream option. You can see many examples in the API section of every model.

Authentication

DeepInfra requires an API token to access any of its APIs. You can find yours in the dashboard

To authenticate your requests, you need to pass your API token in the Authorization header with type Bearer.

Authorization: bearer $AUTH_TOKEN

or pass it as a parameter to the appropriate library.

Content types

Our inference API supports multipart/form-data and application/json content types. We strongly suggest to use the latter whenever possible.

multipart/form-data

Using multipart/form-data makes sense when you want to send binary data such as media files. Using this content type requires less bandwidth and is more efficient for large files.

application/json

Using application/json makes sense when you want to send text data. You can also use this content type for binary data, using data urls. For example:

{
  "image": "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBD..."
}
copy

HTTP Status Codes

We use standard HTTP status codes to indicate the status of the request.

200 - OK. The request was successful.
4xx - Bad Request. The request was invalid or cannot be served.
5xx - Internal Server Error. Something went wrong on our side.

Response Body

The response body is always a JSON object containing the model output. It also contains metadata about the inference request like request_id, cost, runtime_ms (except for LLMs), tokens_input, tokens_generated (LLMs only).

Example response:

{
  "request_id": "RfMWDr1NXCd7cnaegcm3A8q0",
  "inference_status": {
    "cost": 0.004639499820768833,
    "runtime_ms": 1285,
    "status": "succeeded"
  },
  "text": "Hello World"
}
copy

Available Models Data Privacy & Security

Unlock the most affordable AI hosting

Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.

Contact Sales Get Started

Latest Models

openai/

whisper-tiny

Phind/

Phind-CodeLlama-34B-v2

bigcode/

starcoder2-15b

openchat/

openchat_3.5

Gryphe/

MythoMax-L2-13b

Featured Models

google/

gemini-2.5-flash

meta-llama/

Llama-Guard-4-12B

google/

gemma-3-27b-it

google/

gemma-3-4b-it

mistralai/

Mistral-Small-3.2-24B-Instruct-2506

google/

gemma-3-12b-it

Company

Pricing

Docs

Compare

DeepStart

About

Careers

Trust Center

Privacy

Terms

Have questions or need a custom solution?

Contact Sales