Hugging Face sped up LLM inference by 22% with asynchronous batching

Q: Источник материала?

Оригинальная публикация на Hugging Face Blog. Hamidun News обрабатывает и адаптирует материалы с помощью AI.

Q: Когда опубликовано?

2026-05-17. Время чтения: 2 мин.

Hugging Face unlocked asynchrony in LLM inference. Instead of CPU and GPU working one after the other, they now run in parallel: while the GPU computes a batch,

Hamidun News Editorial

AI monitoring · Hugging Face Blog

2026-05-17· 2 min

Hugging Face sped up LLM inference by 22% with asynchronous batching — Source: Hugging Face Blog. Collage: Hamidun News.

◐ Listen to article

Hugging Face found a simple way to accelerate token generation in LLMs by 22% — by simply making the CPU and GPU work simultaneously instead of waiting for each other. This doesn't require retraining models or new algorithms.

What Was the Problem

In typical synchronous batching, the CPU prepares data, then the GPU computes it, then the CPU waits for results and prepares the next batch. The GPU ends up idle about 24% of the time — while the CPU handles preprocessing. It's like an assembly line where at each step someone is watching the clock. Even if the CPU is working at full capacity, it's slower than the simplest GPU computation, so the graphics card just sits and waits for new data. On synchronous batching, an 8-billion-parameter model with batch size 32 generated 8K tokens in 300 seconds with GPU activity only at 76%.

How Asynchronousness Solves the Problem

The idea is extremely simple: run the CPU and GPU in parallel through CUDA streams. Three independent streams are used — one for transferring data to the GPU (H2D), one for the actual computations (Compute), and one for the reverse transfer (D2H). Each stream returns control to the CPU immediately without blocking it. CUDA events are placed between streams — markers that guarantee the correct order:

H2D stream marks when data is loaded into the GPU
Compute stream waits for this marker, computes, sets its own
D2H stream waits for computation completion and retrieves the results

While the GPU is computing batch N, the CPU is already preparing batch N+1 in a different memory buffer. No waiting, they work in parallel.

The Main Carry-Over Trick

The complexity is that when a request spans multiple batches, its output tokens from batch N become the input for batch N+1. Hugging Face solved this with placeholders: when preparing batch N+1, zeros are placed where the future token inputs go, and when batch N completes, the zeros are replaced with real values.

"This allows the CPU to prepare the next batch without waiting for the current one's results,"

Hugging Face explains. Everything is held together with CUDA graphs with a shared memory pool, which even saves video memory.

The Numbers Speak for Themselves

Before optimization: GPU active 76%, generating 8K tokens — 300.6 seconds. After: GPU active 99.4%, same tokens — 234.5 seconds. 22% acceleration — nearly the theoretical maximum for this architecture (the theoretical maximum was 24%).

What This Means

This means you don't need more expensive GPUs or model retraining for serious speed improvements. The code is already in the Transformers library, the `ContinuousBatchingAsyncIOs` class. For companies generating many tokens, this could save millions on hardware.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com