Inference

Test-Time Compute

Test-time compute (TTC) is the computational budget—processing cycles, memory, and time—that a model expends during inference rather than training, allowing it to spend more effort on harder problems without any change to its weights.

Test-time compute (TTC) is the computation a neural network model uses at inference time, i.e., while generating outputs for a given input, as distinct from the fixed compute spent during training. Unlike training cost, which is a one-time expenditure, TTC can be scaled dynamically per query: easy requests receive minimal compute, while difficult ones receive substantially more, enabling adaptive resource allocation without retraining.

Models leverage additional test-time compute through several mechanisms. Best-of-N sampling generates multiple candidate responses and selects the highest-scoring one using a reward model. Iterative self-refinement loops let a model critique and revise its own draft. Most prominently, extended chain-of-thought reasoning produces long internal reasoning traces—sometimes thousands of tokens—before emitting a final answer. OpenAI's o1 (released September 2024) and o3 models are the most widely cited examples of architectures explicitly optimized to scale TTC through reinforcement-learned reasoning.

The central insight is that performance on hard reasoning tasks—competition mathematics, complex code generation, multi-step planning—scales predictably with test-time compute, similar to how training performance scales with training compute. This shifts a key design lever from the training phase (expensive, infrequent) to the inference phase (on-demand, priceable per query), and enables providers to offer tiered quality at tiered cost.

By 2026, TTC scaling has become a mainstream design axis across frontier labs. Google's Gemini 2.0 Flash Thinking, DeepSeek-R1, and Anthropic's Claude 3.7 Sonnet with extended thinking all expose explicit reasoning-token budgets. Research focuses on efficient search strategies—such as Monte Carlo Tree Search applied to token generation and process reward models that score intermediate steps—to maximize output quality per unit of compute spent.

Example

When tasked with proving a non-trivial combinatorics identity, a model configured with a high TTC budget generates over 800 intermediate reasoning tokens—exploring multiple proof strategies and checking for contradictions—before committing to a final, verified answer, achieving accuracy on competition-level benchmarks that a direct single-pass response cannot match.

Related terms

← Glossary