AWS and vLLM integrated P-EAGLE to speed up large LLM inference by up to 1.69x
AWS and vLLM added P-EAGLE, a parallel variant of speculative decoding for faster LLM inference. Instead of generating draft tokens sequentially, the method…
AI-processed from AWS Machine Learning Blog; edited by Hamidun News
AWS and the vLLM team demonstrated P-EAGLE — a new way to accelerate the inference of large language models without changing the base model. The approach is already integrated into vLLM, and in tests with GPT-OSS 20B it achieved up to 1.69× higher throughput compared to standard EAGLE-3.
Where the bottleneck was
Speculative decoding has long been considered one of the most practical ways to speed up LLMs during inference. The idea is that an auxiliary model suggests several next tokens in advance, while the main model quickly verifies which ones can be accepted. The EAGLE method already provided a noticeable improvement and was used in vLLM, SGLang, and TensorRT-LLM.
But it had one important problem: to generate K draft tokens, the drafter models had to make K sequential forward passes. The deeper the speculation, the stronger the latency of the drafter itself grew. Because of this, classic EAGLE hit a hidden ceiling.
On paper, the desire was to speculate deeper and accept more tokens per round, but in practice the additional work of the drafter model started to eat away the benefit. The authors of P-EAGLE remove exactly this limitation: all K draft tokens are generated in a single pass. This shifts the balance toward more aggressive speculation, especially on long responses and code tasks, where every extra sequential operation is noticeable in latency and throughput.
How P-EAGLE works
P-EAGLE architecture is two-stage. First, the target model processes the prompt and, as usual, predicts the next token. At the same time, the system saves internal hidden states for prompt positions and for the new token.
Then the drafter collects inputs for all future positions in parallel: for already known parts of the sequence, it uses real embeddings and hidden states, while for positions that don't yet exist it substitutes learnable masks and a shared hidden vector. After this, several future tokens are predicted in a single forward pass, rather than a chain of several steps. A separate complexity is training on long sequences.
AWS notes that for GPT-OSS 120B on UltraChat, the median sequence length together with prompt and generation reached 3891 tokens, and the 90th percentile reached 10800 tokens. With parallel draft decoding, memory grows very quickly, because the number of positions becomes N × K. For this, the authors added a sequence partition algorithm: it divides one long sequence into continuous chunks, preserves correct attention dependencies between them, and allows accumulating gradients within a single example, not just between different batches.
Integration and numbers
The practical part was not limited to the paper: P-EAGLE has already been added to vLLM starting with version 0.16.0. To enable it, just use speculative decoding with the parallel_drafting: true flag and connect a compatible drafter-head.
AWS has already released ready-made checkpoints for GPT-OSS 120B, GPT-OSS 20B, and Qwen3-Coder 30B, so the technology can be tried without training from scratch.
- Integration appeared in vLLM starting with version 0.16.0
- The mode is enabled via the parallel_drafting: true flag
- Ready P-EAGLE head models are available for GPT-OSS 120B, GPT-OSS 20B, and Qwen3-Coder 30B
- On NVIDIA B200, the improvement over standard EAGLE-3 ranged from 1.05× to 1.69×
- The best P-EAGLE throughput in tests was achieved at speculation depth K=7
The benchmark picture looks consistent. On MT-Bench, HumanEval, and SPEED-Bench, the new method showed an improvement of 55–69% under low contention and maintained a gain of 5–25% even under high load. In addition to speed, acceptance length also improved — the average number of draft tokens accepted by the verifier per round. For example, at K=7 on HumanEval, P-EAGLE got 3.94 versus 3.03 for EAGLE-3, and on SPEED-Bench — 3.38 versus 2.59. AWS specifically notes that running GPT-OSS 20B with EAGLE-drafter currently requires a one-line patch to vLLM, which should be included in one of the next releases.
What this means
For teams already using vLLM in production, P-EAGLE looks like a rare improvement without a complete stack overhaul: the new scheme is built into the familiar runtime and is activated by config plus a compatible checkpoint. If the ecosystem quickly gets more parallel-trained drafter models, then this variant of speculative decoding could become the new standard for fast and cheap LLM inference.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.