We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

GLM-5.1 - state-of-the-art agentic engineering, now available on DeepInfra!

Inference LoRA adapter model
Published on 2024.12.06 by Askar Aitzhan
Inference LoRA adapter model

Understanding LoRA inference

Concepts

  • Base model: The original model that is used as a starting point.
  • LoRA adapter model: A small model that is used to adapt the base model for a specific task.
  • LoRA Rank: The rank of the matrix that is used to adapt the model.

What you need to inference with LoRA adapter model

  1. Supported base model
  2. LoRA adapter model hosted on HuggingFace
  3. HuggingFace token if the LoRA adapter model is private
  4. DeepInfra account

How to inference with LoRA adapter in DeepInfra

  1. Go to the dashboard
  2. Click on the 'New Deployment' button
  3. Click on the 'LoRA Model' tab
  4. Fill the form:
    • LoRA model name: model name used to reference the deployment
    • Hugging Face Model Name: Hugging Face model name
    • Hugging Face Token: (optional) Hugging Face token if the LoRA adapter model is private
  5. Click on the 'Upload' button

Note: The list of supported base models is listed on the same page. If you need a base model that is not listed, please contact us at feedback@deepinfra.com

Rate limits on LoRA adapter model

Rate limit will apply on combined traffic of all LoRA adapter models with the same base model. For example, if you have 2 LoRA adapter models with the same base model, and have rate limit of 200. Those 2 LoRA adapter models combined will have rate limit of 200.

Pricing on LoRA adapter model

Pricing is 50% higher than base model.

How is LoRA adapter model speed compared to base model speed?

LoRA adapter model speed is lower than base model, because there is additional compute and memory overhead to apply the LoRA adapter. From our benchmarks, the LoRA adapter model speed is about 50-60% slower than base model.

How to make LoRA adapter model faster?

You could merge the LoRA adapter with the base model to reduce the overhead. And use custom deployment, the speed will be close to the base model.

Related articles
GLM-4.6 API: Get fast first tokens at the best $/M from Deepinfra's API - Deep InfraGLM-4.6 API: Get fast first tokens at the best $/M from Deepinfra's API - Deep Infra<p>GLM-4.6 is a high-capacity, “reasoning”-tuned model that shows up in coding copilots, long-context RAG, and multi-tool agent loops. With this class of workload, provider infrastructure determines perceived speed (first-token time), tail stability, and your unit economics. Using ArtificialAnalysis (AA) provider charts for GLM-4.6 (Reasoning), DeepInfra (FP8) pairs a sub-second Time-to-First-Token (TTFT) (0.51 s) with the [&hellip;]</p>
Function Calling in DeepInfra: Extend Your AI with Real-World LogicFunction Calling in DeepInfra: Extend Your AI with Real-World Logic<p>Modern large language models (LLMs) are incredibly powerful at understanding and generating text, but until recently they were largely static: they could only respond based on patterns in their training data. Function calling changes that. It lets language models interact with external logic — your own code, APIs, utilities, or business systems — while still [&hellip;]</p>
Qwen3.5 9B API Benchmarks: Latency, Throughput & CostQwen3.5 9B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 9B Qwen3.5 9B is the flagship of Alibaba&#8217;s Qwen3.5 Small Model Series, released on March 2, 2026. It is a dense multimodal model combining Gated Delta Networks (a form of linear attention) with a sparse Mixture-of-Experts system, enabling higher throughput and lower latency during inference compared to traditional dense architectures. The architecture utilizes [&hellip;]</p>