Inference

Top-p (Nucleus) Sampling

Top-p (nucleus) sampling is a decoding strategy that restricts token selection to the smallest set of tokens whose cumulative probability meets a threshold p, dynamically adapting the candidate pool size to the model's confidence at each generation step.

Top-p sampling, also called nucleus sampling, is a decoding strategy introduced by Holtzman et al. in "The Curious Case of Neural Text Degeneration" (ICLR 2020). At each generation step, tokens are sorted by descending probability and the nucleus is defined as the smallest prefix of that sorted list whose cumulative probability is at least p. The next token is drawn by sampling from the nucleus after renormalizing its probabilities to sum to 1.

The key advantage over fixed top-k sampling is adaptivity. When the model is highly confident—for example, after the prompt "The chemical symbol for gold is"—the nucleus may contain only one or two tokens, keeping generation deterministic and accurate. When the model faces genuine ambiguity—such as the next word in an open-ended story—the nucleus expands to dozens or hundreds of candidates, enabling creative diversity. A fixed top-k value cannot achieve this balance: a small k is too restrictive in uncertain contexts, while a large k admits too many unlikely tokens when the model is confident. The p hyperparameter is typically set between 0.9 and 0.95 for general-purpose use.

Top-p sampling matters because it empirically reduces the degenerate repetition and incoherence that affect greedy and pure temperature-based decoding, while preventing sampling from the far tail of the distribution where incoherent or hallucinated tokens cluster. It is commonly combined with temperature: temperature reshapes the logit distribution first, then top-p sampling selects from the resulting nucleus. Together they provide two complementary levers—overall diversity and tail truncation—that can be tuned independently.

Top-p is a standard parameter in virtually all production language model APIs and inference frameworks as of 2026, including OpenAI, Anthropic's Claude API, Google Gemini, vLLM, and Hugging Face Transformers. Research has explored alternatives such as min-p sampling (removing tokens whose probability falls below a fraction of the top token's probability) and top-a sampling, each offering slightly different tail-truncation behaviors. Despite these variants, top-p remains the dominant approach due to its simplicity and well-understood empirical behavior across model families.

Example

With p=0.92 and temperature=0.8, a language model generating a mystery novel keeps its next-word candidates focused on plausible plot continuations—typically a few dozen tokens—while still permitting unexpected but coherent choices that greedy decoding would never produce.

Related terms

Latest news on this topic

← Glossary