Build a Streaming Chat Backend in 10 Minutes

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

Published on 2026.02.02 by DeepInfra

When large language models move from demos into real systems, expectations change. The goal is no longer to produce clever text, but to deliver predictable latency, responsive behavior, and reliable infrastructure characteristics.

In chat-based systems, especially, how fast a response starts often matters more than how fast it finishes. This is where token streaming becomes critical. This article explains how to build a simple, production-ready streaming chat backend in Python, why streaming fundamentally differs from traditional (non-streaming) responses, how to measure Time To First Token (TTFT), and how to choose an appropriate LLM for this workload.

Why Latency Perception Matters More Than Raw Speed

In a traditional LLM integration, the flow usually looks like this:

Send a request to the model
Wait until the model finishes generating the entire response
Return the full text to the client

This approach is simple, but it has a serious drawback: nothing happens while the model is thinking. If the response takes three seconds to generate, the user sees three seconds of silence. Even if the final answer is excellent, the system feels slow, unresponsive, or broken.

Human perception is highly sensitive to feedback delays. Once a system crosses roughly 300–500 milliseconds without visible output, users begin to lose confidence. This is not a model quality issue—it is an interaction design problem.

Streaming directly addresses this gap.

Non-Streaming vs. Streaming: A Structural Difference

The key distinction between non-streaming and streaming responses is not the content itself, but when that content becomes visible.

In non-streaming mode, the model generates the entire response internally before returning anything to the client. This approach is simple and predictable, making it well-suited for use cases such as JSON-only APIs, data extraction, and schema-validated workflows where the full output must be available at once. However, because no partial output is sent, users experience the entire generation time as waiting, which can make conversational systems feel slow or unresponsive.

Streaming takes the opposite approach. Instead of waiting for the full response, the model emits tokens as soon as they are generated and delivers them incrementally. While the total generation time is similar, the time to the first visible output is dramatically lower. This creates a much more responsive experience and makes streaming ideal for chat interfaces, assistants, and copilot-style applications.

The response itself does not change. What changes is the timing—and that timing has a significant impact on how the system is perceived.

Time To First Token (TTFT): The Key Metric

When working with streaming systems, TTFT becomes one of the most important performance indicators.

TTFT measures the time between:

Sending the request
Receiving the first generated token

A system with:

4 seconds total generation time
but 200 ms TTFT

feels significantly faster than a system with:

2 seconds total generation time
but no output until the end

From a user’s perspective, the system “responds immediately,” even if the full answer takes longer. Because of this, TTFT should be explicitly measured and logged in production systems.

Choosing the Right Model for Streaming Chat

For this article, we use:

deepseek-ai/DeepSeek-V3copy

Why This Model?

DeepSeek-V3 is one of the most compelling modern models for streaming chat backends because it combines high reasoning quality with unusually low inference cost. Unlike many large frontier models, it was explicitly designed with efficiency in mind, making it well-suited for real-time, high-throughput applications.

One of its most important characteristics is fast and consistent token emission. The model starts producing meaningful output quickly, resulting in a low Time To First Token even for non-trivial prompts. This makes it particularly effective in streaming scenarios, where early feedback matters more than absolute completion time.

DeepSeek-V3 also demonstrates strong instruction adherence without verbose prompting. Short, minimal system prompts are sufficient to guide behavior, and the model avoids unnecessary preambles or conversational filler. This reduces token overhead and helps keep streams clean and predictable—an important property when tokens are forwarded directly to clients.

Another advantage is streaming stability. The model produces a smooth, incremental token flow without long initial pauses or erratic bursts. This simplifies stream handling in backend services and improves reliability under load.

Finally, DeepSeek-V3 is cheap relative to its capability and broadly available through OpenAI-compatible APIs, including platforms like DeepInfra. This makes it easy to integrate into existing systems while maintaining cost control at scale.

A Minimal System Prompt for Streaming

Streaming performance is heavily influenced by prompt length and structure. Overly verbose system prompts increase:

Time To First Token
Token overhead
Variability in responses

A short, role-defining system prompt works best:

You are a backend chat service.
Respond concisely.
Do not add meta commentary.copy

This prompt:

Frames the model as infrastructure
Prevents conversational fillers like “Sure!” or “Here’s an explanation”
Encourages immediate content generation
Improves TTFT consistency

In streaming scenarios, brevity is a performance feature.

The following example demonstrates a minimal streaming chat backend using Python and the DeepInfra OpenAI-compatible API.

Setup

import time
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.deepinfra.com/v1/openai"
)

MODEL = "deepseek-ai/DeepSeek-V3"

SYSTEM_PROMPT = """
You are a backend chat service.
Respond concisely.
Do not add meta commentary.
"""copy

Streaming Function with TTFT Measurement

def stream_chat(user_input: str):
    start_time = time.time()
    first_token_time = None

    stream = client.chat.completions.create(
        model=MODEL,
        stream=True,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_input},
        ],
        temperature=0.4,
    )

    for chunk in stream:
        delta = chunk.choices[0].delta
        if not delta or not delta.content:
            continue

        if first_token_time is None:
            first_token_time = time.time()
            ttft = first_token_time - start_time
            print(f"\n[TTFT: {ttft:.3f}s]\n")

        yield delta.contentcopy

Example Usage

for token in stream_chat("Explain token streaming in one paragraph."):
    print(token, end="", flush=True)copy

Example Output

[TTFT: 1.078s]
Token streaming is a technique used in natural language processing (NLP) where text is generated and transmitted incrementally, one token (word or subword) at a time, rather than waiting for the entire output to be generated. This approach reduces latency by allowing the user to see partial results immediately, improving the responsiveness of applications like chatbots or real-time translation systems. It is particularly useful in scenarios where quick feedback is essential, enabling a smoother and more interactive user experience.copy

The important part is not the text itself, but how quickly it starts.

Integrating Streaming Chat into a FastAPI Backend

Once a streaming chat function is available at the Python level, the next step is to expose it as an HTTP API that clients can consume in real-time. In practice, FastAPI is a natural choice for this, as it integrates cleanly with Python generators and streaming responses.

The core idea is simple: instead of returning a single response object, the API endpoint returns a stream of events. Each event contains either a token, metadata such as Time To First Token (TTFT), or a signal that the stream has finished. Server-Sent Events (SSE) are commonly used for this purpose because they are lightweight, widely supported, and well-suited for text-based streaming.

At the FastAPI level, this is handled using StreamingResponse. The endpoint itself remains thin; it does not manage conversation state or presentation logic. Its sole responsibility is to forward tokens as they are produced by the model.

A minimal endpoint definition looks like this:

import json
import time
from typing import Generator, Optional

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
from openai import OpenAI

# --- LLM client (OpenAI-compatible; e.g., DeepInfra) ---
client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://api.deepinfra.com/v1/openai",
)

MODEL = "deepseek-ai/DeepSeek-V3"

SYSTEM_PROMPT = (
    "You are a backend chat service.\n"
    "Respond concisely.\n"
    "Do not add meta commentary."
)

# --- FastAPI app ---
app = FastAPI(title="Streaming Chat API")

class ChatRequest(BaseModel):
    """
    Minimal request payload for streaming chat.
    Extend with conversation history, user_id, etc. as needed.
    """
    message: str = Field(..., min_length=1, description="User input message")
    temperature: float = Field(0.4, ge=0.0, le=2.0, description="Sampling temperature")

def _sse(data: dict, event: Optional[str] = None) -> str:
    """
    Format a Server-Sent Event (SSE) message.

    SSE frame format:
      event: <name>   (optional)
      data: <json>    (required)
      \n              (blank line ends the event)

    We use JSON payloads so the frontend can reliably parse events.
    """
    payload = json.dumps(data, ensure_ascii=False)
    if event:
        return f"event: {event}\ndata: {payload}\n\n"
    return f"data: {payload}\n\n"

def _stream_chat_tokens(message: str, temperature: float) -> Generator[str, None, None]:
    """
    Streams tokens from DeepSeek-V3 and yields SSE frames.

    Events emitted:
      - meta:  stream started
      - ttft:  time-to-first-token in seconds
      - token: incremental token chunks
      - done:  stream completed

    Any exceptions are sent as an 'error' event.
    """
    start = time.time()
    first_token_at = None

    # Optional start event: helps clients initialize UI state immediately
    yield _sse({"status": "started"}, event="meta")

    try:
        stream = client.chat.completions.create(
            model=MODEL,
            stream=True,
            temperature=temperature,
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": message},
            ],
        )

        for chunk in stream:
            delta = chunk.choices[0].delta
            content = getattr(delta, "content", None)
            if not content:
                continue

            # Emit TTFT once, exactly when the first token arrives
            if first_token_at is None:
                first_token_at = time.time()
                yield _sse({"ttft_seconds": round(first_token_at - start, 3)}, event="ttft")

            # Emit token chunks
            yield _sse({"token": content}, event="token")

        # Completed
        total = time.time() - start
        yield _sse({"status": "done", "total_seconds": round(total, 3)}, event="done")

    except Exception as e:
        # Never crash the stream silently—send a final error event
        yield _sse({"status": "error", "message": str(e)}, event="error")

@app.post("/chat/stream")
def chat_stream(req: ChatRequest) -> StreamingResponse:
    """
    FastAPI endpoint that returns an SSE stream of model tokens.

    Why SSE:
      - works over standard HTTP
      - easy to consume from browsers and many clients
      - ideal for unidirectional token streams

    Response:
      media_type="text/event-stream" keeps the connection open
      and flushes events incrementally.
    """
    generator = _stream_chat_tokens(req.message, req.temperature)

    return StreamingResponse(
        generator,
        media_type="text/event-stream",
        headers={
            # Recommended headers for SSE
            "Cache-Control": "no-cache, no-transform",
            "Connection": "keep-alive",
            # If you're behind Nginx, this helps disable response buffering:
            # "X-Accel-Buffering": "no",
        },
    )copy

The streaming logic lives inside a generator function that yields chunks as they arrive from the model. This generator measures TTFT once, emits token events incrementally, and finally signals completion.

Each chunk received from the model is forwarded immediately. FastAPI takes care of flushing these chunks to the client over an open HTTP connection. From an architectural standpoint, this keeps the backend stateless, predictable, and easy to scale.

What Comes Next: Building the Frontend Experience

With a streaming API in place, the focus shifts to the client side. Whether the frontend is a web app, a desktop application, or an internal tool, its job is to consume the stream and render partial output as it arrives.

In a typical JavaScript frontend, the client opens a streaming connection to the FastAPI endpoint and listens for incoming events. Each token is appended to the currently visible assistant message, creating the familiar “typing” effect. Because the backend already emits TTFT and completion signals, the frontend can react intelligently—showing loading indicators, measuring responsiveness, or enabling a “stop generating” action.

This is also where modern workflows increasingly use LLMs themselves. Teams often rely on language models to help design chat layouts, suggest interaction patterns, and iterate on microcopy or UX details. Because the backend exposes a clean, streaming-based interface, frontend experimentation can happen rapidly without requiring backend changes.

As the system matures, additional layers are typically added:

conversation state management (client-side, server-side, or hybrid),
observability and logging of latency and token usage,
cost controls such as rate limits and token caps.

Crucially, none of these require changes to the core streaming mechanism. The streaming API becomes a stable foundation on which different interfaces can evolve.

Conclusion

Streaming is not a cosmetic feature—it is a structural choice that shapes how LLM-powered systems behave in production. By delivering tokens as soon as they are generated, streaming dramatically improves perceived latency, enables early cancellation, and creates a natural interface for interactive applications.

Using a modern, cost-efficient model like DeepSeek-V3, combined with a lightweight system prompt and explicit TTFT measurement, makes it possible to build a responsive chat backend with minimal complexity. FastAPI provides a clean way to expose this functionality over HTTP, while frontend clients—often designed with the help of LLMs themselves—can focus entirely on user experience.

If large language models are part of your infrastructure, streaming should be the default for any interactive use case. Everything else builds on that foundation.

Accelerating Reasoning Workflows with Nemotron 3 Nano on DeepInfraDeepInfra is an official launch partner for NVIDIA Nemotron 3 Nano, the newest open reasoning model in the Nemotron family. Our goal is to give developers, researchers, and teams the fastest and simplest path to using Nemotron 3 Nano from day one.

Use OpenAI API clients with LLaMasGetting started # create a virtual environment python3 -m venv .venv # activate environment in current shell . .venv/bin/activate # install openai python client pip install openai Choose a model meta-llama/Llama-2-70b-chat-hf [meta-llama/L...

Seed Anchoring and Parameter Tweaking with SDXL Turbo: Create Stunning Cubist ArtIn this blog post, we're going to explore how to create stunning cubist art using SDXL Turbo using some advanced image generation techniques.

View all