Inference

Temperature

Temperature is a scalar hyperparameter that divides a language model's logits before the softmax step, controlling output randomness: values below 1.0 sharpen the distribution toward high-probability tokens; values above 1.0 flatten it, increasing diversity.

Temperature is a control parameter used during token sampling in language models. It is applied by dividing all logit scores by the temperature value T before the softmax function converts them into a probability distribution over the vocabulary. When T = 1.0, the model samples according to its learned distribution without modification. When T < 1.0, the distribution becomes sharper, concentrating probability mass on the most likely tokens. When T > 1.0, the distribution flattens, giving lower-probability tokens a greater chance of being selected.

The mathematical effect is direct: given a logit vector z, the temperature-scaled softmax is computed as softmax(z / T). As T approaches 0, the distribution collapses to a one-hot vector at the argmax—equivalent to greedy decoding, always selecting the single most probable token. As T increases toward infinity, the distribution converges to uniform across the entire vocabulary. In practice, temperatures between 0.0 and 2.0 cover nearly all useful behavior; values above 1.5 tend to produce lexically incoherent output for most current model families.

Temperature matters because the same underlying model can serve qualitatively different use cases through this single parameter. Code generation and factual question-answering benefit from low temperatures (0.0–0.3) to maximize accuracy and reproducibility. Creative writing, brainstorming, and open-ended dialogue benefit from higher temperatures (0.7–1.2) to produce varied and surprising outputs. Setting temperature too high introduces incoherence; setting it too low produces repetitive, overly conservative text that fails to reflect the model's full range of knowledge.

Every major language model API—OpenAI, Anthropic, Google, Mistral, Meta—exposes temperature as a first-class parameter. Research published in 2024–2025 examined the interaction between temperature and chain-of-thought reasoning, finding that multi-step logical tasks benefit from very low temperature to maintain consistency, while ensemble-based methods such as self-consistency deliberately sample multiple high-temperature completions and aggregate them. Some inference frameworks also implement temperature annealing within a single generation, gradually reducing temperature as output progresses toward a conclusion.

Example

A customer-support chatbot is deployed with temperature=0.1 to produce accurate, predictable answers to policy questions, while the same base model powering a creative writing tool runs at temperature=1.1 to generate varied and inventive story continuations.

Related terms

Latest news on this topic

← Glossary