AWS Machine Learning Blog→ original

AWS shows how speculative decoding on Trainium2 accelerates generation in vLLM

AWS demonstrated how speculative decoding on Trainium2 can significantly reduce generation costs in LLMs when workloads are bottlenecked by long output. In…

AI-processed from AWS Machine Learning Blog; edited by Hamidun News
AWS shows how speculative decoding on Trainium2 accelerates generation in vLLM
Source: AWS Machine Learning Blog. Collage: Hamidun News.
◐ Listen to article

AWS showed a practical way to accelerate and reduce the cost of LLM inference on Trainium2 for scenarios where the model generates significantly more tokens than it receives as input. This is speculative decoding: instead of forcing a large model to sequentially output one token at a time, the system connects a small draft-model that quickly proposes several next tokens at once, while the main target-model verifies them in a single pass. If the predictions match, the service spends fewer expensive sequential steps, reduces latency between tokens, and better utilizes the accelerator.

This is especially important for decode-heavy workloads — writing assistants, coding agents, report generation, templated documents, and other tasks with long responses. In standard autoregressive generation, each new token is computed separately, so the accelerator constantly reads KV-cache from memory and performs relatively little useful work per step. Because of this, inference often hits memory bandwidth limits rather than pure computation.

Speculative decoding targets exactly this bottleneck: the target-model executes sequential decode steps less frequently, and batch verification makes the workload denser. However, the approach has requirements. Draft and target models must use the same tokenizer and vocabulary, and ideally belong to the same architectural family so the small model more often guesses the continuation of the main one.

A key parameter is the number of speculative tokens. If the window is too small, the gain is barely noticeable; if too large, early rejections and unnecessary verification consume the benefit. In their test, AWS used the target-model Qwen3-32B and draft-model Qwen3-1.

7B, running through vLLM on a trn2.48xlarge instance. For speculative decoding, they chose fused speculation in NeuronX Distributed Inference, where both models are compiled together for better performance.

The baseline and speculative configurations were deployed in a single Amazon EKS cluster with everything kept identical: accelerator allocation, tensor parallelism, context length, batch limits, and Neuron image. The only difference was the addition of the draft-model and the num_speculative_tokens parameter. Load was applied to both services via llmperf, and TTFT, inter-token latency, and end-to-end latency were sent to CloudWatch for comparison.

AWS also tested the more compact Qwen3-0.6B, but its acceptance rate was approximately 60 percent lower, which was enough to lose most of the benefit. In the range of 5 to 15 speculative tokens, the optimal point in these tests was a configuration with seven tokens, though the authors emphasize that the optimal value strongly depends on prompt structure.

Ultimately, request structure determined the result. On predictable scenarios — repeated text, numeric sequences, simple code — speculative decoding showed clear benefits. In such cases, the draft-model often guesses what the target-model would output anyway, so the system skips a significant portion of sequential steps.

In tests, inter-token latency dropped to around 15 milliseconds per token, and the end-to-end latency curve consistently stayed below baseline. On open, less deterministic requests, the picture is different: the draft-model more often diverges from the target-model, tokens are rejected, and the potential gain disappears. For such prompts, inter-token latency hovered around 45 milliseconds per token, and speculative and baseline configurations showed almost identical end-to-end latency.

TTFT — time to first token — barely changed because speculative decoding does not accelerate the prefill stage, where the model encodes the input context. The main benefit appears later, in the decode phase, from reducing the number of expensive sequential steps by the target-model. The practical conclusion from the article is simple: speculative decoding on Trainium2 is not a universal acceleration button, but a targeted optimization for a specific workload type.

If your product often generates structured and predictable output — code, data extraction, templated reports, configs — this mode can directly reduce output token cost and increase throughput without quality loss. If you primarily have open-ended chat with free-form generation, the effect may be minimal. Therefore, implementing this scheme is worthwhile only after benchmarking on your own prompts, selecting a compatible draft-model and speculative token window suited to real scenarios, rather than relying on benchmarks in isolation from your product.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…