Hugging Face sped up LLM inference by 22% with asynchronous batching
Hugging Face unlocked asynchrony in LLM inference. Instead of CPU and GPU working one after the other, they now run in parallel: while the GPU computes a batch,

Hugging Face found a simple way to accelerate token generation in LLMs by 22% — by simply making the CPU and GPU work simultaneously instead of waiting for each other. This doesn't require retraining models or new algorithms.
What Was the Problem
In typical synchronous batching, the CPU prepares data, then the GPU computes it, then the CPU waits for results and prepares the next batch. The GPU ends up idle about 24% of the time — while the CPU handles preprocessing. It's like an assembly line where at each step someone is watching the clock. Even if the CPU is working at full capacity, it's slower than the simplest GPU computation, so the graphics card just sits and waits for new data. On synchronous batching, an 8-billion-parameter model with batch size 32 generated 8K tokens in 300 seconds with GPU activity only at 76%.
How Asynchronousness Solves the Problem
The idea is extremely simple: run the CPU and GPU in parallel through CUDA streams. Three independent streams are used — one for transferring data to the GPU (H2D), one for the actual computations (Compute), and one for the reverse transfer (D2H). Each stream returns control to the CPU immediately without blocking it. CUDA events are placed between streams — markers that guarantee the correct order:
- H2D stream marks when data is loaded into the GPU
- Compute stream waits for this marker, computes, sets its own
- D2H stream waits for computation completion and retrieves the results
While the GPU is computing batch N, the CPU is already preparing batch N+1 in a different memory buffer. No waiting, they work in parallel.
The Main Carry-Over Trick
The complexity is that when a request spans multiple batches, its output tokens from batch N become the input for batch N+1. Hugging Face solved this with placeholders: when preparing batch N+1, zeros are placed where the future token inputs go, and when batch N completes, the zeros are replaced with real values.
"This allows the CPU to prepare the next batch without waiting for the current one's results,"
Hugging Face explains. Everything is held together with CUDA graphs with a shared memory pool, which even saves video memory.
The Numbers Speak for Themselves
Before optimization: GPU active 76%, generating 8K tokens — 300.6 seconds. After: GPU active 99.4%, same tokens — 234.5 seconds. 22% acceleration — nearly the theoretical maximum for this architecture (the theoretical maximum was 24%).
What This Means
This means you don't need more expensive GPUs or model retraining for serious speed improvements. The code is already in the Transformers library, the `ContinuousBatchingAsyncIOs` class. For companies generating many tokens, this could save millions on hardware.