Latency
Latency in AI inference is the elapsed time between submitting a request to a model and receiving its response, typically measured in milliseconds. In large language models it is subdivided into time-to-first-token (TTFT) and inter-token latency (TPOT).
Latency measures the delay accumulated across every stage of the inference pipeline from the moment a request is sent to the moment a response is received. Two sub-metrics matter most for LLMs: time-to-first-token (TTFT), the delay before the first output token appears, and time-per-output-token (TPOT), the pace at which subsequent tokens arrive. End-to-end latency equals roughly TTFT plus (TPOT × number of output tokens).
Latency accumulates across network transit, request queuing, KV-cache lookup, and GPU computation. The prefill phase — processing the full input prompt in parallel — dominates TTFT and scales with prompt length. The autoregressive decode phase, which generates one token per forward pass, determines TPOT. During decoding, GPU memory bandwidth rather than raw compute is typically the binding constraint, because weight matrices must be loaded from HBM memory for each token step.
For interactive applications such as chatbots, coding assistants, and voice agents, high latency directly degrades usability. Human-factors research places the threshold for feeling "instant" at roughly 200 ms; above 1–2 seconds, user engagement and task-completion rates fall measurably. In agentic workflows where a model invokes tools in loops, latency compounds across many sequential calls, making each step's delay consequential.
As of 2026, frontier hosted APIs from OpenAI, Anthropic, and Google typically deliver TTFTs under 500 ms and streaming speeds of 40–100 tokens per second on standard requests. Optimization techniques including speculative decoding (using a small draft model to propose tokens verified by a larger model), continuous batching, and quantization have reduced latency substantially since 2023. Specialized hardware — NVIDIA H100/H200, AMD MI300X, Google TPU v5e — provides the memory bandwidth needed to push TPOT below 10 ms per token.