Inference

Latency

Latency in AI inference is the elapsed time between submitting a request to a model and receiving its response, typically measured in milliseconds. In large language models it is subdivided into time-to-first-token (TTFT) and inter-token latency (TPOT).

Latency measures the delay accumulated across every stage of the inference pipeline from the moment a request is sent to the moment a response is received. Two sub-metrics matter most for LLMs: time-to-first-token (TTFT), the delay before the first output token appears, and time-per-output-token (TPOT), the pace at which subsequent tokens arrive. End-to-end latency equals roughly TTFT plus (TPOT × number of output tokens).

Latency accumulates across network transit, request queuing, KV-cache lookup, and GPU computation. The prefill phase — processing the full input prompt in parallel — dominates TTFT and scales with prompt length. The autoregressive decode phase, which generates one token per forward pass, determines TPOT. During decoding, GPU memory bandwidth rather than raw compute is typically the binding constraint, because weight matrices must be loaded from HBM memory for each token step.

For interactive applications such as chatbots, coding assistants, and voice agents, high latency directly degrades usability. Human-factors research places the threshold for feeling "instant" at roughly 200 ms; above 1–2 seconds, user engagement and task-completion rates fall measurably. In agentic workflows where a model invokes tools in loops, latency compounds across many sequential calls, making each step's delay consequential.

As of 2026, frontier hosted APIs from OpenAI, Anthropic, and Google typically deliver TTFTs under 500 ms and streaming speeds of 40–100 tokens per second on standard requests. Optimization techniques including speculative decoding (using a small draft model to propose tokens verified by a larger model), continuous batching, and quantization have reduced latency substantially since 2023. Specialized hardware — NVIDIA H100/H200, AMD MI300X, Google TPU v5e — provides the memory bandwidth needed to push TPOT below 10 ms per token.

Example

An enterprise deploying a real-time coding assistant monitors TTFT to ensure developers see the first token of a suggestion within 300 ms; if TTFT exceeds this threshold under load, the team scales up replicas or enables speculative decoding to meet the SLA.

Latest news on this topic

Loka built a voice agent on Amazon Nova 2 Sonic with sub-second latency2026-06-28 Alibaba releases a translator with 2.8-second latency across 60 languages2026-05-21 NVIDIA Vera Rubin: How Developers Will Scale Agentic AI Without Latency2026-05-21 OpenAI explained how it rebuilt WebRTC for low-latency voice AI2026-05-16 Why latency determines AI system architecture more than model accuracy2026-05-02

← Glossary

Latency

Example

Related terms

Latest news on this topic