GLM-5.1 - state-of-the-art agentic engineering, now available on DeepInfra!

Note: The list of supported base models is listed on the same page. If you need a base model that is not listed, please contact us at feedback@deepinfra.com
Rate limit will apply on combined traffic of all LoRA adapter models with the same base model. For example, if you have 2 LoRA adapter models with the same base model, and have rate limit of 200. Those 2 LoRA adapter models combined will have rate limit of 200.
Pricing is 50% higher than base model.
LoRA adapter model speed is lower than base model, because there is additional compute and memory overhead to apply the LoRA adapter. From our benchmarks, the LoRA adapter model speed is about 50-60% slower than base model.
You could merge the LoRA adapter with the base model to reduce the overhead. And use custom deployment, the speed will be close to the base model.
Qwen3.5 122B A10B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 122B A10B Qwen3.5 122B A10B is Alibaba Cloud’s mid-tier multimodal foundation model, released in February 2026. It is a multimodal vision-language Mixture-of-Experts model supporting text, image, and video inputs, designed for native multimodal agent applications. It features 122 billion total parameters with 10 billion activated per token through a hybrid architecture that integrates […]</p>
GLM-4.7-Flash API Benchmarks: Latency, Throughput & Cost<p>About GLM-4.7-Flash GLM-4.7-Flash is Z.AI’s open-weights reasoning model released in January 2026. Built on a Mixture-of-Experts (MoE) Transformer architecture, it features 30 billion total parameters with only ~3 billion active per inference — making it exceptionally efficient for its capability class. The model is designed as a lightweight, cost-effective alternative to Z.AI’s flagship GLM-4.7, optimized […]</p>
© 2026 Deep Infra. All rights reserved.