GLM-5.1 - state-of-the-art agentic engineering, now available on DeepInfra!

# create a virtual environment
python3 -m venv .venv
# activate environment in current shell
. .venv/bin/activate
# install openai python client
pip install openai
import openai
stream = True # or False
# Point OpenAI client to our endpoint
openai.api_key = "<YOUR DEEPINFRA API KEY>"
openai.api_base = "https://api.deepinfra.com/v1/openai"
# Your chosen model here
MODEL_DI = "meta-llama/Llama-2-70b-chat-hf"
chat_completion = openai.ChatCompletion.create(
model=MODEL_DI,
messages=[{"role": "user", "content": "Hello world"}],
stream=stream,
max_tokens=100,
# top_p=0.5,
)
if stream:
# print the chat completion
for event in chat_completion:
print(event.choices)
else:
print(chat_completion.choices[0].message.content)
Note that both streaming and batch mode are supported.
If you're already using OpenAI chat completion in your project, you need to
change the api_key, api_base and model params:
import openai
# set these before running any completions
openai.api_key = "YOUR DEEPINFRA TOKEN"
openai.api_base = "https://api.deepinfra.com/v1/openai"
openai.ChatCompletion.create(
model="CHOSEN MODEL HERE",
# ...
)
Our OpenAI API compatible models are priced on token output (just like OpenAI). Our current price is $1 / 1M tokens.
Check the docs for more in-depth information and examples openai api.
MiniMax-M2.5 API Benchmarks: Latency, Throughput & Cost<p>About MiniMax-M2.5 MiniMax-M2.5 is a state-of-the-art open-weights large language model released in February 2026. Built on a 230B-parameter Mixture of Experts (MoE) architecture with approximately 10 billion active parameters per forward pass, it features Lightning Attention and supports a context window of up to 205,000 tokens. The model uses extended chain-of-thought reasoning to work through […]</p>
NVIDIA Nemotron 3 Super 120B API Benchmarks: Latency & Cost<p>About NVIDIA Nemotron 3 Super 120B A12B NVIDIA’s Nemotron 3 Super 120B A12B is an open-weight large language model released on March 11, 2026. It features 120B total parameters with only 12B active per forward pass, delivering exceptional compute efficiency for complex multi-agent applications such as software development and cybersecurity triaging. The model uses a […]</p>
Reliable JSON-Only Responses with DeepInfra LLMs<p>When large language models are used inside real applications, their role changes fundamentally. Instead of chatting with users, they become infrastructure components: extracting information, transforming text, driving workflows, or powering APIs. In these scenarios, natural language is no longer the desired output. What applications need is structured data — and very often, that structure is […]</p>
© 2026 Deep Infra. All rights reserved.