Documentation
Custom LLMs
Contents
Custom LLMs
You can now run a dedicated instance of your public or private LLM on DeepInfra infrastructure.
There are a number of benefits for running your own custom LLM instance:
There are of course some drawbacks:
It's important to understand that all our publicly available models, like mixtral 8x7, are shared among many users, and this lets us offer very competitive pricing as a result. When you run your own model, you get full access to the GPUs and pay per GPU/hours your model is up. So you have to have a sufficient load to justify this resource.
A deployment is a particular configuration of your custom models. It has fixed:
model_name
-- the name you'd use when doing inference (generation)gpu
type -- A100-80GB or H100-80GB supported now, expect more in the futurenum_gpus
-- how many GPUs to use, bigger models require more GPUs (it
should at least fit the weights and have some leftover for KV cache)max_batch_size
-- how many requests to run in parallel (at most), other
requests are queued upIt also has a few settings that can be changed dynamically:
min_instances
-- how many copies of the model to run at a minumummax_instances
-- up to how many copies to scale during times of higher
loadTo create a new deployment, use the the Web UI:
Or, using the HTTP API:
curl -X POST https://api.deepinfra.com/deploy/llm -d '{
"model_name": "test-model",
"gpu": "A100-80GB",
"num_gpus": 2,
"max_batch_size": 64,
"hf": {
"repo": "meta-llama/Llama-2-7b-chat-hf"
},
"settings": {
"min_instances": 0,
"max_instances": 1,
}
}' -H 'Content-Type: application/json' \
-H "Authorization: Bearer $DEEPINFRA_TOKEN"
The deploy can be monitored via HTTP, or the Web dashboard.
Please note that the model full name is github-username/model-name. My github
username is ichernev
, so the model above will have full name
ichernev/test-model
.
When you create a deployment, the name you specify is prefixed by your github
username. So if I (ichernev) create a model test-model
, it's full name will
be ichernev/test-model
. You can then use this name during inference, or the
check the model web page:
You can use your model via:
Once a deployment is running, its scaling parameters can be updated via the deployment details page accessible from Dashboard / Deployments.
via HTTP:
curl -X PUT https://api.deepinfra.com/deploy/DEPLOY_ID -d '{
"settings": {
"min_instances": 2,
"max_instances": 2,
}
}' -H 'Content-Type: application/json' \
-H "Authorization: Bearer YOUR_API_KEY"
You'd need your DEPLOY_ID
. It is returned on creation, but also available in
Web Dashboard or via HTTP API
/deploy/list
.
When you want to permanently delete / discard a deployment, use:
min_instances
== 3, but
we can only run 2). You're only billed for what actually runs. The current
numer of running instances is returned in the deploy objectRun models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.