Streaming
Streaming in AI inference is the delivery of model output token by token to the client as each token is generated, rather than waiting for the full response before transmitting anything. It reduces perceived latency to roughly the time-to-first-token and enables progressive rendering of long responses.
Streaming sends each generated token — or a small buffer of tokens — to the client immediately after it is produced, over a persistent connection that remains open for the duration of generation. The two standard transport mechanisms are Server-Sent Events (SSE) over HTTP/1.1, where the server pushes newline-delimited JSON chunks, and bidirectional gRPC streams. The client reads arriving chunks and appends them to the display in real time, producing the typewriter-style output familiar from ChatGPT, Claude, and Gemini interfaces.
From the model's perspective, computation is identical whether streaming is enabled or not: the autoregressive decoder produces one token per forward pass regardless. The difference is purely in delivery — without streaming, the server buffers all tokens and flushes them in a single HTTP response body after generation completes; with streaming, each token or micro-batch triggers a write to the open socket. This imposes negligible additional server overhead while fundamentally changing the user's experience of latency.
For responses of moderate to long length, non-streaming delivery requires the user to wait the full generation time — potentially 10–30 seconds for multi-paragraph outputs — before seeing anything. Streaming reduces the subjective wait to the time-to-first-token, typically under one second on optimized systems. It also enables early stopping: a user can interrupt generation once they have enough information, saving server compute that would otherwise be spent completing an unwanted response. In voice pipelines and agent loops, streaming is architecturally essential: text-to-speech synthesis can begin consuming the first sentence while the model is still generating later paragraphs, shaving seconds off voice response latency.
Streaming is the default delivery mode for all major LLM APIs as of 2026, including those from OpenAI, Anthropic, Google, Mistral, and Cohere. OpenAI's SSE chunk format — `data: {"choices":[{"delta":{"content":"token"}}]}` terminated by `data: [DONE]` — has become a de facto standard adopted by vLLM, Ollama, LiteLLM, and many other compatible open-source servers, simplifying client integration across providers.