We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Langchain improvements: async and streaming
Published on 2023.10.25 by Iskren Chernev
Langchain improvements: async and streaming

Starting from langchain v0.0.322 you can make efficient async generation and streaming tokens with deepinfra.

Async generation

The deepinfra wrapper now supports native async calls, so you can expect more performance (no more threads per invocation) from your async pipelines.

from langchain.llms.deepinfra import DeepInfra

async def async_predict():
    llm = DeepInfra(model_id="meta-llama/Llama-2-7b-chat-hf")
    output = await llm.apredict("What is 2 + 2?")
    print(output)
copy

Response streaming

Streaming lets you receive each token of the response as it gets generated. This is indispensable in user-facing applications.

def streaming():
    llm = DeepInfra(model_id="meta-llama/Llama-2-7b-chat-hf")
    for chunk in llm.stream("[INST] Hello [/INST] "):
        print(chunk, end='', flush=True)
    print()
copy

You can also use the asynchronous streaming API, natively implemented underneath.

async def async_streaming():
    llm = DeepInfra(model_id="meta-llama/Llama-2-7b-chat-hf")
    async for chunk in llm.astream("[INST] Hello [/INST] "):
        print(chunk, end='', flush=True)
    print()
copy
Related articles
How to deploy Databricks Dolly v2 12b, instruction tuned casual language model.How to deploy Databricks Dolly v2 12b, instruction tuned casual language model.Databricks Dolly is instruction tuned 12 billion parameter casual language model based on EleutherAI's pythia-12b. It was pretrained on The Pile, GPT-J's pretraining corpus. [databricks-dolly-15k](http...
NVIDIA Nemotron 3 Super 120B API BenchmarksNVIDIA Nemotron 3 Super 120B API Benchmarks<p>NVIDIA Nemotron 3 Super 120B A12B is available across multiple API providers, and the spread in performance and cost is wide enough to change deployment decisions. Artificial Analysis benchmarks three providers — Lightning AI, CoreWeave, and Nebius — with output speed ranging from 154 to 509 t/s (a 3.3x gap), TTFT spanning 0.98s to 1.94s, [&hellip;]</p>
Kimi K2.5 API Benchmarks: Latency, Throughput & CostKimi K2.5 API Benchmarks: Latency, Throughput & Cost<p>About Kimi K2.5 Kimi K2.5 is Moonshot AI&#8217;s flagship open-source reasoning model, released in January 2026. It is a native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens. The model features a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters and 32 billion activated parameters. Kimi K2.5 [&hellip;]</p>