We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

FLUX.2 is live! High-fidelity image generation made simple.

Deploy Custom LLMs on DeepInfra
Published on 2024.03.01 by Iskren Chernev
Deploy Custom LLMs on DeepInfra

Did you just finetune your favorite model and are wondering where to run it? Well, we have you covered. Simple API and predictable pricing.

Put your model on huggingface

Use a private repo, if you wish, we don't mind. Create a hf access token just for the repo for better security.

Create custom deployment

Via Web

You can use the Web UI to create a new deployment.

Custom LLM Web UI

Via HTTP

We also offer HTTP API:

curl -X POST https://api.deepinfra.com/deploy/llm -d '{
    "model_name": "test-model",
    "gpu": "A100-80GB",
    "num_gpus": 2,
    "max_batch_size": 64,
    "hf": {
        "repo": "meta-llama/Llama-2-7b-chat-hf"
    },
    "settings": {
        "min_instances": 1,
        "max_instances": 1,
    }
}' -H 'Content-Type: application/json' \
    -H "Authorization: Bearer YOUR_API_KEY"
copy

Use it

curl -X POST \
    -d '{"input": "Hello"}' \
    -H 'Content-Type: application/json' \
    -H "Authorization: Bearer YOUR_API_KEY" \
    'https://api.deepinfra.com/v1/inference/github-username/di-model-name'
copy

For in depth tutorial check Custom LLM Docs.

Related articles
Langchain improvements: async and streamingLangchain improvements: async and streamingStarting from langchain v0.0.322 you can make efficient async generation and streaming tokens with deepinfra. Async generation The deepinfra wrapper now supports native async calls, so you can expect more performance (no more t...
Guaranteed JSON output on Open-Source LLMs.Guaranteed JSON output on Open-Source LLMs.DeepInfra is proud to announce that we have released "JSON mode" across all of our text language models. It is available through the "response_format" object, which currently supports only {"type": "json_object"} Our JSON mode will guarantee that all tokens returned in the output of a langua...
Inference LoRA adapter modelInference LoRA adapter modelLearn how to inference LoRA adapter model.