Inference

Batching

Batching in model inference is the practice of grouping multiple input requests and processing them together in a single forward pass through the model, improving GPU utilization and throughput. Continuous batching extends this by inserting and removing sequences mid-generation rather than waiting for an entire batch to finish.

Batching aggregates multiple inference requests — each a sequence of tokens — into a single tensor fed to the model simultaneously. GPU matrix-multiply operations process a batch of n sequences nearly as fast as a single sequence when memory capacity allows, because the fixed cost of loading model weights from memory is amortized across all batch members. Without batching, each request incurs that overhead independently, leaving GPUs underutilized.

In static (or synchronous) batching, a fixed group of requests is assembled before generation begins; all sequences must complete before the batch is released, so fast-finishing requests sit idle waiting for slow ones. This head-of-line blocking typically yields GPU utilization of 20–40% under mixed-length workloads. Continuous batching, introduced in the Orca research paper (2022) and adopted by vLLM (2023), schedules at the iteration level: completed sequences are evicted and new requests inserted at every decode step, keeping the GPU fully loaded regardless of sequence-length variance. This raises utilization to 70–90%+ in practice.

Batching is the primary mechanism by which serving systems amortize GPU overhead across concurrent users. It is also the main lever available to operators who want to increase throughput without adding hardware. The cost is increased latency for individual requests, because a request must sometimes wait in queue until a batch slot opens — a trade-off that can be tuned by adjusting maximum batch size and scheduling policy.

As of 2026, continuous batching is the default strategy in major open-source LLM runtimes — vLLM, LMDeploy, SGLang, and MLC-LLM — and is used internally by all large commercial inference providers. Active research areas include chunked prefill (interleaving prompt processing with decoding to reduce latency spikes), speculative batching, and disaggregated architectures that separate prefill and decode workloads onto different hardware pools for finer resource control.

Example

A customer-service platform uses continuous batching in vLLM so that a two-token "yes/no" classification and a 500-token complaint summary are processed on the same GPU simultaneously, preventing short requests from stalling behind long ones and keeping median latency below 800 ms.

Related terms

Latest news on this topic

← Glossary