FLUX.2 is live! High-fidelity image generation made simple.

When large language models move from demos into real systems, expectations change. The goal is no longer to produce clever text, but to deliver predictable latency, responsive behavior, and reliable infrastructure characteristics.
In chat-based systems, especially, how fast a response starts often matters more than how fast it finishes. This is where token streaming becomes critical. This article explains how to build a simple, production-ready streaming chat backend in Python, why streaming fundamentally differs from traditional (non-streaming) responses, how to measure Time To First Token (TTFT), and how to choose an appropriate LLM for this workload.
In a traditional LLM integration, the flow usually looks like this:
This approach is simple, but it has a serious drawback: nothing happens while the model is thinking. If the response takes three seconds to generate, the user sees three seconds of silence. Even if the final answer is excellent, the system feels slow, unresponsive, or broken.
Human perception is highly sensitive to feedback delays. Once a system crosses roughly 300–500 milliseconds without visible output, users begin to lose confidence. This is not a model quality issue—it is an interaction design problem.
Streaming directly addresses this gap.
The key distinction between non-streaming and streaming responses is not the content itself, but when that content becomes visible.
In non-streaming mode, the model generates the entire response internally before returning anything to the client. This approach is simple and predictable, making it well-suited for use cases such as JSON-only APIs, data extraction, and schema-validated workflows where the full output must be available at once. However, because no partial output is sent, users experience the entire generation time as waiting, which can make conversational systems feel slow or unresponsive.
Streaming takes the opposite approach. Instead of waiting for the full response, the model emits tokens as soon as they are generated and delivers them incrementally. While the total generation time is similar, the time to the first visible output is dramatically lower. This creates a much more responsive experience and makes streaming ideal for chat interfaces, assistants, and copilot-style applications.
The response itself does not change. What changes is the timing—and that timing has a significant impact on how the system is perceived.
When working with streaming systems, TTFT becomes one of the most important performance indicators.
TTFT measures the time between:
A system with:
feels significantly faster than a system with:
From a user’s perspective, the system “responds immediately,” even if the full answer takes longer. Because of this, TTFT should be explicitly measured and logged in production systems.
For this article, we use:
deepseek-ai/DeepSeek-V3DeepSeek-V3 is one of the most compelling modern models for streaming chat backends because it combines high reasoning quality with unusually low inference cost. Unlike many large frontier models, it was explicitly designed with efficiency in mind, making it well-suited for real-time, high-throughput applications.
One of its most important characteristics is fast and consistent token emission. The model starts producing meaningful output quickly, resulting in a low Time To First Token even for non-trivial prompts. This makes it particularly effective in streaming scenarios, where early feedback matters more than absolute completion time.
DeepSeek-V3 also demonstrates strong instruction adherence without verbose prompting. Short, minimal system prompts are sufficient to guide behavior, and the model avoids unnecessary preambles or conversational filler. This reduces token overhead and helps keep streams clean and predictable—an important property when tokens are forwarded directly to clients.
Another advantage is streaming stability. The model produces a smooth, incremental token flow without long initial pauses or erratic bursts. This simplifies stream handling in backend services and improves reliability under load.
Finally, DeepSeek-V3 is cheap relative to its capability and broadly available through OpenAI-compatible APIs, including platforms like DeepInfra. This makes it easy to integrate into existing systems while maintaining cost control at scale.
Streaming performance is heavily influenced by prompt length and structure. Overly verbose system prompts increase:
A short, role-defining system prompt works best:
You are a backend chat service.
Respond concisely.
Do not add meta commentary.This prompt:
In streaming scenarios, brevity is a performance feature.
The following example demonstrates a minimal streaming chat backend using Python and the DeepInfra OpenAI-compatible API.
import time
from openai import OpenAI
client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://api.deepinfra.com/v1/openai"
)
MODEL = "deepseek-ai/DeepSeek-V3"
SYSTEM_PROMPT = """
You are a backend chat service.
Respond concisely.
Do not add meta commentary.
"""Streaming Function with TTFT Measurement
def stream_chat(user_input: str):
start_time = time.time()
first_token_time = None
stream = client.chat.completions.create(
model=MODEL,
stream=True,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_input},
],
temperature=0.4,
)
for chunk in stream:
delta = chunk.choices[0].delta
if not delta or not delta.content:
continue
if first_token_time is None:
first_token_time = time.time()
ttft = first_token_time - start_time
print(f"\n[TTFT: {ttft:.3f}s]\n")
yield delta.contentExample Usage
for token in stream_chat("Explain token streaming in one paragraph."):
print(token, end="", flush=True)Example Output
[TTFT: 1.078s]
Token streaming is a technique used in natural language processing (NLP) where text is generated and transmitted incrementally, one token (word or subword) at a time, rather than waiting for the entire output to be generated. This approach reduces latency by allowing the user to see partial results immediately, improving the responsiveness of applications like chatbots or real-time translation systems. It is particularly useful in scenarios where quick feedback is essential, enabling a smoother and more interactive user experience.The important part is not the text itself, but how quickly it starts.
Once a streaming chat function is available at the Python level, the next step is to expose it as an HTTP API that clients can consume in real-time. In practice, FastAPI is a natural choice for this, as it integrates cleanly with Python generators and streaming responses.
The core idea is simple: instead of returning a single response object, the API endpoint returns a stream of events. Each event contains either a token, metadata such as Time To First Token (TTFT), or a signal that the stream has finished. Server-Sent Events (SSE) are commonly used for this purpose because they are lightweight, widely supported, and well-suited for text-based streaming.
At the FastAPI level, this is handled using StreamingResponse. The endpoint itself remains thin; it does not manage conversation state or presentation logic. Its sole responsibility is to forward tokens as they are produced by the model.
A minimal endpoint definition looks like this:
import json
import time
from typing import Generator, Optional
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
from openai import OpenAI
# --- LLM client (OpenAI-compatible; e.g., DeepInfra) ---
client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://api.deepinfra.com/v1/openai",
)
MODEL = "deepseek-ai/DeepSeek-V3"
SYSTEM_PROMPT = (
"You are a backend chat service.\n"
"Respond concisely.\n"
"Do not add meta commentary."
)
# --- FastAPI app ---
app = FastAPI(title="Streaming Chat API")
class ChatRequest(BaseModel):
"""
Minimal request payload for streaming chat.
Extend with conversation history, user_id, etc. as needed.
"""
message: str = Field(..., min_length=1, description="User input message")
temperature: float = Field(0.4, ge=0.0, le=2.0, description="Sampling temperature")
def _sse(data: dict, event: Optional[str] = None) -> str:
"""
Format a Server-Sent Event (SSE) message.
SSE frame format:
event: <name> (optional)
data: <json> (required)
\n (blank line ends the event)
We use JSON payloads so the frontend can reliably parse events.
"""
payload = json.dumps(data, ensure_ascii=False)
if event:
return f"event: {event}\ndata: {payload}\n\n"
return f"data: {payload}\n\n"
def _stream_chat_tokens(message: str, temperature: float) -> Generator[str, None, None]:
"""
Streams tokens from DeepSeek-V3 and yields SSE frames.
Events emitted:
- meta: stream started
- ttft: time-to-first-token in seconds
- token: incremental token chunks
- done: stream completed
Any exceptions are sent as an 'error' event.
"""
start = time.time()
first_token_at = None
# Optional start event: helps clients initialize UI state immediately
yield _sse({"status": "started"}, event="meta")
try:
stream = client.chat.completions.create(
model=MODEL,
stream=True,
temperature=temperature,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": message},
],
)
for chunk in stream:
delta = chunk.choices[0].delta
content = getattr(delta, "content", None)
if not content:
continue
# Emit TTFT once, exactly when the first token arrives
if first_token_at is None:
first_token_at = time.time()
yield _sse({"ttft_seconds": round(first_token_at - start, 3)}, event="ttft")
# Emit token chunks
yield _sse({"token": content}, event="token")
# Completed
total = time.time() - start
yield _sse({"status": "done", "total_seconds": round(total, 3)}, event="done")
except Exception as e:
# Never crash the stream silently—send a final error event
yield _sse({"status": "error", "message": str(e)}, event="error")
@app.post("/chat/stream")
def chat_stream(req: ChatRequest) -> StreamingResponse:
"""
FastAPI endpoint that returns an SSE stream of model tokens.
Why SSE:
- works over standard HTTP
- easy to consume from browsers and many clients
- ideal for unidirectional token streams
Response:
media_type="text/event-stream" keeps the connection open
and flushes events incrementally.
"""
generator = _stream_chat_tokens(req.message, req.temperature)
return StreamingResponse(
generator,
media_type="text/event-stream",
headers={
# Recommended headers for SSE
"Cache-Control": "no-cache, no-transform",
"Connection": "keep-alive",
# If you're behind Nginx, this helps disable response buffering:
# "X-Accel-Buffering": "no",
},
)The streaming logic lives inside a generator function that yields chunks as they arrive from the model. This generator measures TTFT once, emits token events incrementally, and finally signals completion.
Each chunk received from the model is forwarded immediately. FastAPI takes care of flushing these chunks to the client over an open HTTP connection. From an architectural standpoint, this keeps the backend stateless, predictable, and easy to scale.
With a streaming API in place, the focus shifts to the client side. Whether the frontend is a web app, a desktop application, or an internal tool, its job is to consume the stream and render partial output as it arrives.
In a typical JavaScript frontend, the client opens a streaming connection to the FastAPI endpoint and listens for incoming events. Each token is appended to the currently visible assistant message, creating the familiar “typing” effect. Because the backend already emits TTFT and completion signals, the frontend can react intelligently—showing loading indicators, measuring responsiveness, or enabling a “stop generating” action.
This is also where modern workflows increasingly use LLMs themselves. Teams often rely on language models to help design chat layouts, suggest interaction patterns, and iterate on microcopy or UX details. Because the backend exposes a clean, streaming-based interface, frontend experimentation can happen rapidly without requiring backend changes.
As the system matures, additional layers are typically added:
Crucially, none of these require changes to the core streaming mechanism. The streaming API becomes a stable foundation on which different interfaces can evolve.
Streaming is not a cosmetic feature—it is a structural choice that shapes how LLM-powered systems behave in production. By delivering tokens as soon as they are generated, streaming dramatically improves perceived latency, enables early cancellation, and creates a natural interface for interactive applications.
Using a modern, cost-efficient model like DeepSeek-V3, combined with a lightweight system prompt and explicit TTFT measurement, makes it possible to build a responsive chat backend with minimal complexity. FastAPI provides a clean way to expose this functionality over HTTP, while frontend clients—often designed with the help of LLMs themselves—can focus entirely on user experience.
If large language models are part of your infrastructure, streaming should be the default for any interactive use case. Everything else builds on that foundation.
LLM API Provider Performance KPIs 101: TTFT, Throughput & End-to-End Goals<p>Fast, predictable responses turn a clever demo into a dependable product. If you’re building on an LLM API provider like DeepInfra, three performance ideas will carry you surprisingly far: time-to-first-token (TTFT), throughput, and an explicit end-to-end (E2E) goal that blends speed, reliability, and cost into something users actually feel. This beginner-friendly guide explains each KPI […]</p>
Best API for Kimi K2.5: Why DeepInfra Leads in Speed, TTFT, and Scalability<p>Kimi K2.5 is positioned as Moonshot AI’s “do-it-all” model for modern product workflows: native multimodality (text + vision/video), Instant vs. Thinking modes, and support for agentic / multi-agent (“swarm”) execution patterns. In real applications, though, model capability is only half the story. The provider’s inference stack determines the things your users actually feel: time-to-first-token (TTFT), […]</p>
Lzlv model for roleplaying and creative workRecently an interesting new model got released.
It is called Lzlv, and it is basically
a merge of few existing models. This model is using the Vicuna prompt format, so keep this
in mind if you are using our raw [API](/lizpreciatior/lzlv_70b...© 2026 Deep Infra. All rights reserved.