DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Starting from langchain v0.0.322 you can make efficient async generation and streaming tokens with deepinfra.
The deepinfra wrapper now supports native async calls, so you can expect more performance (no more threads per invocation) from your async pipelines.
from langchain.llms.deepinfra import DeepInfra
async def async_predict():
llm = DeepInfra(model_id="meta-llama/Llama-2-7b-chat-hf")
output = await llm.apredict("What is 2 + 2?")
print(output)
Streaming lets you receive each token of the response as it gets generated. This is indispensable in user-facing applications.
def streaming():
llm = DeepInfra(model_id="meta-llama/Llama-2-7b-chat-hf")
for chunk in llm.stream("[INST] Hello [/INST] "):
print(chunk, end='', flush=True)
print()
You can also use the asynchronous streaming API, natively implemented underneath.
async def async_streaming():
llm = DeepInfra(model_id="meta-llama/Llama-2-7b-chat-hf")
async for chunk in llm.astream("[INST] Hello [/INST] "):
print(chunk, end='', flush=True)
print()
How to deploy Databricks Dolly v2 12b, instruction tuned casual language model.Databricks Dolly is instruction tuned 12 billion parameter casual language model based on EleutherAI's pythia-12b.
It was pretrained on The Pile, GPT-J's pretraining corpus.
[databricks-dolly-15k](http...
NVIDIA Nemotron 3 Super 120B API Benchmarks<p>NVIDIA Nemotron 3 Super 120B A12B is available across multiple API providers, and the spread in performance and cost is wide enough to change deployment decisions. Artificial Analysis benchmarks three providers — Lightning AI, CoreWeave, and Nebius — with output speed ranging from 154 to 509 t/s (a 3.3x gap), TTFT spanning 0.98s to 1.94s, […]</p>
Kimi K2.5 API Benchmarks: Latency, Throughput & Cost<p>About Kimi K2.5 Kimi K2.5 is Moonshot AI’s flagship open-source reasoning model, released in January 2026. It is a native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens. The model features a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion activated parameters. Kimi K2.5 […]</p>
© 2026 DeepInfra. All rights reserved.