Inference

Throughput

Throughput in AI inference is the volume of work a model serving system processes per unit of time, commonly expressed as output tokens per second or completed requests per second across all concurrent users. It reflects total system capacity rather than any single request's speed.

Throughput quantifies the aggregate productive output rate of an inference deployment — how many tokens are generated or how many requests are completed across all concurrent sessions in a given time window. It is the system-level counterpart to latency: while latency describes one user's experience, throughput describes the system's overall processing capacity. The two metrics are linked but trade against each other; increasing batch size raises throughput while increasing per-request latency.

Throughput scales with GPU count, batch size, and model optimizations. Continuous batching processes tokens from multiple in-flight requests in a single forward pass, raising GPU utilization. Tensor parallelism distributes model weights across multiple GPUs, enabling larger batches than a single GPU's memory allows. Quantization — reducing weight precision from FP16 to INT8 or INT4 — shrinks memory footprint, fitting more concurrent sequences. Pipeline parallelism across nodes further extends capacity for very large models.

For high-traffic deployments — customer-support bots, search augmentation, large-scale document processing — throughput determines cost per token and the maximum concurrent user load the infrastructure can sustain without queuing delays. Doubling throughput at constant hardware halves the unit inference cost, which at the scale of billions of daily tokens represents significant operating expense.

In 2025–2026, optimized open-source serving stacks such as vLLM, SGLang, and TensorRT-LLM have demonstrated throughputs of several thousand output tokens per second per H100 GPU for models in the 7B–70B parameter range. Cloud providers publish throughput benchmarks under sustained load to help customers size clusters for their traffic patterns. Research on chunked prefill, disaggregated prefill/decode, and speculative execution continues to push throughput higher while keeping tail latency bounded.

Example

A company running a nightly document-summarization pipeline configures continuous batching on a four-GPU node to sustain 6,000 output tokens per second across 200 concurrent jobs, completing the full queue within a two-hour processing window.

Related terms

Latest news on this topic

← Glossary