We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

Documentation

Getting Started

You don't need to install anything to do your first inference. You only need your access token.

Go to the API section on any model's page. Grab one of the examples. If you are logged in your access token will be prefilled for you.

You can try one of the examples from meta-llama/Meta-Llama-3-8B-Instruct

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
      "model": "meta-llama/Meta-Llama-3-8B-Instruct",
      "messages": [
        {
          "role": "user",
          "content": "Hello!"
        }
      ]
    }'
copy

and it will respond with something like

{
    "id": "chatcmpl-guMTxWgpFf",
    "object": "chat.completion",
    "created": 1694623155,
    "model": "meta-llama/Meta-Llama-3-8B-Instruct",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": " Hello! It's nice to meet you. Is there something I can help you with or would you like to chat for a bit?"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 15,
        "completion_tokens": 16,
        "total_tokens": 31,
        "estimated_cost": 0.0000268
    }
}
copy

This example uses the OpenAI Chat Completions API which we strongly recommend because it is the most convenient to use when dealing with LLMs. You can also use it with the official JavaScript/Node.js and Python libraries and they will work out of the box.

If you want to dip your toes a little more in the AI world you can try the following example

curl "https://api.deepinfra.com/v1/inference/meta-llama/Meta-Llama-3-8B-Instruct" \
   -H "Content-Type: application/json" \
   -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
   -d '{
     "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
     "stop": [
       "<|eot_id|>"
     ]
   }'
copy

It is using DeepInfra's API and it require advanced knowledge of how the model works, which in turn gives you more flexiblity. You can read specifics about each model in its API section including stop words, streaming and more.

You will get a response similar to the previous example

{
    "request_id": "RWZDRhS5kdoM1XWwXLEshynO",
    "inference_status": {
        "status": "succeeded",
        "runtime_ms": 243,
        "cost": 0.0000436,
        "tokens_input": 12,
        "tokens_generated": 25
    },
    "results": [
        {
            "generated_text": "Hello! It's nice to meet you. Is there something I can help you with or would you like to chat for a bit?"
        }
    ],
    "num_tokens": 25,
    "num_input_tokens":12
}
copy

Getting Started Available Models

Unlock the most affordable AI hosting

Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.

Contact Sales Get Started

Latest Models

openai/

whisper-tiny

bigcode/

starcoder2-15b

openchat/

openchat_3.5

Gryphe/

MythoMax-L2-13b

Phind/

Phind-CodeLlama-34B-v2

Featured Models

deepseek-ai/

DeepSeek-V3

meta-llama/

Llama-3.3-70B-Instruct

mistralai/

Mistral-Small-3.2-24B-Instruct-2506

sesame/

csm-1b

anthropic/

claude-4-opus

google/

gemma-3-12b-it

Company

Pricing

Docs

Compare

DeepStart

About

Careers

Trust Center

Privacy

Terms

Have questions or need a custom solution?

Contact Sales