Documentation
Custom LLMs
Contents
Custom LLMs
You can now run a dedicated instance of your public or private LLM on DeepInfra infrastructure.
There are a number of benefits for running your own custom LLM instance:
There are of course some drawbacks:
It's important to understand that all our publicly available models, like mixtral 8x7, are shared among many users, and this lets us offer very competitive pricing as a result. When you run your own model, you get full access to the GPUs and pay per GPU/hours your model is up. So you have to have a sufficient load to justify this resource.
A deployment is a particular configuration of your custom models. It has fixed:
model_name
-- the name you'd use when doing inference (generation)gpu
type -- A100-80GB or H100-80GB supported now, expect more in the futurenum_gpus
-- how many GPUs to use, bigger models require more GPUs (it
should at least fit the weights and have some leftover for KV cache)max_batch_size
-- how many requests to run in parallel (at most), other
requests are queued upIt also has a few settings that can be changed dynamically:
min_instances
-- how many copies of the model to run at a minumummax_instances
-- up to how many copies to scale during times of higher
loadTo create a new deployment, use the the Web UI:
Or, using the HTTP API:
curl -X POST https://api.deepinfra.com/deploy/llm -d '{
"model_name": "test-model",
"gpu": "A100-80GB",
"num_gpus": 2,
"max_batch_size": 64,
"hf": {
"repo": "meta-llama/Llama-2-7b-chat-hf"
},
"settings": {
"min_instances": 0,
"max_instances": 1,
}
}' -H 'Content-Type: application/json' \
-H "Authorization: Bearer $DEEPINFRA_TOKEN"
The deploy can be monitored via HTTP, or the Web dashboard.
Please note that the model full name is github-username/model-name. My github
username is ichernev
, so the model above will have full name
ichernev/test-model
.
When you create a deployment, the name you specify is prefixed by your github
username. So if I (ichernev) create a model test-model
, it's full name will
be ichernev/test-model
. You can then use this name during inference, or the
check the model web page:
You can use your model via:
Once a deployment is running, its scaling parameters can be updated via the deployment details page accessible from Dashboard / Deployments.
via HTTP:
curl -X PUT https://api.deepinfra.com/deploy/DEPLOY_ID -d '{
"settings": {
"min_instances": 2,
"max_instances": 2,
}
}' -H 'Content-Type: application/json' \
-H "Authorization: Bearer $(deepctl auth token)"
You'd need your DEPLOY_ID
. It is returned on creation, but also available in
Web Dashboard, deepctl deploy list
or via HTTP API
/deploy/list
.
When you want to permanently delete / discard a deployment, use:
deepctl deploy delete DEPLOY_ID
min_instances
== 3, but
we can only run 2). You're only billed for what actually runs. The current
numer of running instances is returned in the deploy object