Qwen3-Max-Thinking state-of-the-art reasoning model at your fingertips!

Starting from langchain v0.0.322 you can make efficient async generation and streaming tokens with deepinfra.
The deepinfra wrapper now supports native async calls, so you can expect more performance (no more threads per invocation) from your async pipelines.
from langchain.llms.deepinfra import DeepInfra
async def async_predict():
llm = DeepInfra(model_id="meta-llama/Llama-2-7b-chat-hf")
output = await llm.apredict("What is 2 + 2?")
print(output)
Streaming lets you receive each token of the response as it gets generated. This is indispensable in user-facing applications.
def streaming():
llm = DeepInfra(model_id="meta-llama/Llama-2-7b-chat-hf")
for chunk in llm.stream("[INST] Hello [/INST] "):
print(chunk, end='', flush=True)
print()
You can also use the asynchronous streaming API, natively implemented underneath.
async def async_streaming():
llm = DeepInfra(model_id="meta-llama/Llama-2-7b-chat-hf")
async for chunk in llm.astream("[INST] Hello [/INST] "):
print(chunk, end='', flush=True)
print()
Getting StartedGetting an API Key
To use DeepInfra's services, you'll need an API key. You can get one by signing up on our platform.
Sign up or log in to your DeepInfra account at deepinfra.com
Navigate to the Dashboard and select API Keys
Create a new ...
NVIDIA Nemotron API Pricing Guide 2026<p>While everyone knows Llama 3 and Qwen, a quieter revolution has been happening in NVIDIA’s labs. They have been taking standard Llama models and “supercharging” them using advanced alignment techniques and pruning methods. The result is Nemotron—a family of models that frequently tops the “Helpfulness” leaderboards (like Arena Hard), often beating GPT-4o while being significantly […]</p>
Kimi K2 0905 API from Deepinfra: Practical Speed, Predictable Costs, Built for Devs - Deep Infra<p>Kimi K2 0905 is Moonshot’s long-context Mixture-of-Experts update designed for agentic and coding workflows. With a context window up to ~256K tokens, it can ingest large codebases, multi-file documents, or long conversations and still deliver structured, high-quality outputs. But real-world performance isn’t defined by the model alone—it’s determined by the inference provider that serves it: […]</p>
© 2026 Deep Infra. All rights reserved.