DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Starting from langchain v0.0.322 you can make efficient async generation and streaming tokens with deepinfra.
The deepinfra wrapper now supports native async calls, so you can expect more performance (no more threads per invocation) from your async pipelines.
from langchain.llms.deepinfra import DeepInfra
async def async_predict():
llm = DeepInfra(model_id="meta-llama/Llama-2-7b-chat-hf")
output = await llm.apredict("What is 2 + 2?")
print(output)
Streaming lets you receive each token of the response as it gets generated. This is indispensable in user-facing applications.
def streaming():
llm = DeepInfra(model_id="meta-llama/Llama-2-7b-chat-hf")
for chunk in llm.stream("[INST] Hello [/INST] "):
print(chunk, end='', flush=True)
print()
You can also use the asynchronous streaming API, natively implemented underneath.
async def async_streaming():
llm = DeepInfra(model_id="meta-llama/Llama-2-7b-chat-hf")
async for chunk in llm.astream("[INST] Hello [/INST] "):
print(chunk, end='', flush=True)
print()
DeepSeek V4 Pro: Model Overview, Features & Performance Guide<p>DeepSeek V4 Pro is a 1.6-trillion parameter Mixture-of-Experts (MoE) model from DeepSeek, released on April 24, 2026 under the MIT license. It is designed for advanced reasoning, complex software engineering, and long-running agentic tasks, and arrives alongside DeepSeek-V4-Flash, a lighter 284B-parameter variant built for faster, lower-cost inference. The V4 series is DeepSeek’s first two-tier lineup […]</p>
Qwen API Pricing Guide 2026: Max Performance on a Budget<p>If you have been following the AI leaderboards lately, you have likely noticed a new name constantly trading blows with GPT-4o and Claude 3.5 Sonnet: Qwen. Developed by Alibaba Cloud, the Qwen model family (specifically Qwen 2.5 and Qwen 3) has exploded in popularity for one simple reason: unbeatable price-to-performance. In 2025, Qwen is widely […]</p>
NVIDIA Nemotron 3 Super 120B API Benchmarks: Latency & Cost<p>About NVIDIA Nemotron 3 Super 120B A12B NVIDIA’s Nemotron 3 Super 120B A12B is an open-weight large language model released on March 11, 2026. It features 120B total parameters with only 12B active per forward pass, delivering exceptional compute efficiency for complex multi-agent applications such as software development and cybersecurity triaging. The model uses a […]</p>
© 2026 DeepInfra. All rights reserved.