Hugging Face Blog→ original

NVIDIA introduced SPEED-Bench — a unified benchmark for speculative decoding

NVIDIA has released SPEED-Bench, a unified benchmark for speculative decoding that measures not just draft model quality, but the actual speedup on…

AI-processed from Hugging Face Blog; edited by Hamidun News
NVIDIA introduced SPEED-Bench — a unified benchmark for speculative decoding
Source: Hugging Face Blog. Collage: Hamidun News.
◐ Listen to article

NVIDIA has introduced SPEED-Bench on Hugging Face — a new benchmark for speculative decoding, a technique for accelerating large language model inference. Its goal is to measure not laboratory peak performance, but the behavior of models and inference engines on tasks that are closer to real-world operation.

How SPEED-Bench works

The authors start from a simple problem: existing tests are fragmented. Some assess draft model quality on samples that are too small, others measure throughput on short prompts and batch size 1, while others depend on a specific stack that poorly reflects production. As a result, comparing speculative decoding methods is difficult: the same algorithm can look excellent on a toy dataset and noticeably worse on long contexts or under high request concurrency.

SPEED-Bench is split into two parts and supplemented with a unified measurement framework. The qualitative split contains 880 prompts from 18 public sources, distributed across 11 categories — from coding and math to roleplay, RAG, summarization, and multilingual. Each category contains 80 examples selected to reduce semantic duplication and cover as many different scenarios as possible. For selection, the authors embedded candidates with the text-embedding-3-small model and minimized average pairwise similarity within each category.

  • Qualitative split measures acceptance rate and acceptance length across different domains
  • Throughput split checks speed on input sequences from 1k to 32k tokens
  • Each length has three difficulty levels: low-, mixed-, and high-entropy
  • One bucket contains 1,536 prompts, which makes it possible to build stable throughput curves at batch size up to 512
  • The framework can work with TensorRT-LLM, vLLM, and SGLang

The problem of fair engine comparison is also addressed separately. Different inference systems apply chat templates, BOS tokens, and tokenization differently, which means the same model may receive slightly different inputs. In SPEED-Bench, prompt preparation is moved outside: engines receive already pretokenized sequences. This reduces the impact of implementation-specific differences and makes it possible to compare the speculative decoding algorithms themselves rather than preprocessing side effects. The framework also records detailed telemetry for step latency, user TPS, and overall output throughput.

What the tests showed

The first results show that speculative decoding depends heavily on the task type. In low-entropy domains such as coding and math, acceptance length is higher: it is easier for the drafter to predict the next tokens. In more open-ended tasks such as roleplay and writing, the metrics are lower. In the examples from the paper, native MTP-heads in Qwen3-Next deliver an average acceptance length of 2.81, EAGLE3 on GPT-OSS 120B reaches 2.25, and N-Gram on Llama 3.3 70B reaches 1.41; at the same time, N-Gram at batch size 32 drops to an average slowdown of 0.88x instead of acceleration.

Another conclusion concerns aggressive optimizations. The authors examine vocabulary pruning in EAGLE3 separately — a technique that reduces the cost of the final projection. In coding and math, its effect is almost unnoticeable, but on the long tail of user requests, especially in multilingual, RAG, and summarization, acceptance length declines more sharply. In other words, an optimization that looks harmless on a narrow dataset can worsen real-world behavior across a broader set of tasks.

The most practical observation concerns synthetic workloads. In the industry, it is still common to run inference on random tokens, but for speculative decoding this mode distorts the picture. The model recognizes the noise, responds in a templated way, and artificially increases acceptance length. In SPEED-Bench measurements, this leads to throughput being overstated by about 23% compared with realistic workloads. For teams, this is a direct signal: synthetic benchmarks can lead to the wrong choice of draft length or even the entire acceleration scheme.

What it means

SPEED-Bench is an attempt to make speculative decoding evaluation closer to what really matters for teams running LLM in production: long contexts, high batch sizes, different domains, and comparable conditions across engines. If the benchmark gains adoption, the discussion around LLM acceleration will shift from pretty numbers on synthetic tests to reproducible data showing exactly where acceleration works and where it does not. For infra and research teams, that is more useful than yet another record on a single convenient dataset.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…