Inférence

Speculative Decoding

Speculative decoding is an inference technique that uses a small draft model to propose multiple tokens in parallel, then verifies them with the large target model in a single forward pass, reducing latency by 2–4× without changing output quality.

Speculative decoding is a method for accelerating autoregressive language model inference. In standard generation, a large model produces one token at a time through sequential forward passes, each computationally expensive. Speculative decoding breaks this bottleneck by interleaving a fast draft model with the target model to exploit the fact that transformer attention over a fixed-length sequence can be computed in a single parallel pass.

The mechanism works in two stages. A lightweight draft model—for example, a 7B-parameter model acting as a draft for a 70B target—generates K candidate tokens in K cheap sequential passes. The large target model then evaluates all K+1 positions simultaneously in one forward pass, verifying each proposed token against its own distribution. Accepted tokens are kept; at the first rejected token, the process rolls back and substitutes the target model's correction. Crucially, the distribution of accepted tokens is provably identical to what the large model would have generated on its own, so output quality is mathematically unchanged.

The technique matters because it cuts end-to-end latency by roughly 2–4× on typical hardware with no accuracy trade-off. This is especially valuable for interactive applications where streaming speed and time-to-first-token directly affect user experience. The speedup depends on the draft acceptance rate: a well-matched draft that agrees with the target on most tokens yields the largest gains, and the benefit diminishes when the two models diverge significantly in style or domain.

By 2025–2026, speculative decoding is deployed in production by Google for Gemini inference, Anthropic for Claude, and major inference providers including Together AI and Groq. Variants such as Medusa (multiple parallel draft heads attached to a single model), EAGLE (a trained speculative head using feature-level inputs from the target), and self-speculative decoding (using earlier transformer layers as the draft) have extended the technique's applicability and reduced the need for a separately trained draft model.

Exemple

A production API serving a 70B-parameter model co-locates a 7B draft model; at a 75% token acceptance rate, streaming latency drops from roughly 120 ms to under 45 ms per decoded token without any change to the model's outputs.

Termes liés

Inférence Latence Small Language Model (SLM)Token

Dernières actualités sur le sujet

NVIDIA accélère l'inférence sur Blackwell jusqu'à 15 fois avec DFlash Speculative Decoding2026-06-28 NVIDIA a présenté SPEED-Bench — un benchmark unifié pour le speculative decoding2026-05-02

← Glossaire