Training

Scaling Laws

Scaling laws are empirical power-law relationships showing that language model performance improves predictably as model parameters, training data volume, and compute budget increase, enabling researchers to forecast capability gains before committing to expensive training runs.

Scaling laws are empirical relationships — broadly of the form L ∝ N^(-α) for loss L and parameter count N — that describe how the performance of machine learning models changes as a function of model parameter count, training data volume, and total compute budget. Most extensively studied for large language models, these relationships allow researchers to extrapolate expected model quality from small-scale experiments to large ones without running the full expensive training run.

The foundational work was published by Kaplan et al. at OpenAI in 2020, showing that cross-entropy loss for autoregressive language models declines as a smooth power law with each of the three scaling axes — parameters, data, and compute — largely independently of specific architecture details. In 2022, Hoffmann et al. at DeepMind published the Chinchilla paper, which refined the compute-optimal training frontier: prior large models had been systematically undertrained on data relative to their parameter count. Chinchilla's 70-billion-parameter model, trained on 1.4 trillion tokens, matched or exceeded GPT-3 (175 billion parameters) on many benchmarks, establishing that model size and data quantity should scale in roughly equal proportion for compute efficiency.

Scaling laws matter because they convert abstract intuitions into concrete engineering decisions. Before committing thousands of GPU-days to a training run, teams run small-scale ablations and use scaling-law extrapolations to predict the performance of larger models, enabling rational allocation of compute budgets. The Chinchilla result in particular shifted industry norms: subsequent open-weight models, including Llama 2 and the Mistral series, were trained significantly longer on more data than their predecessors at equivalent parameter counts.

By 2026, scaling laws have been extended beyond pure text to multimodal models, code generation, and the post-training stages of reinforcement learning from human feedback. Active research debates whether these laws will plateau as high-quality text from the public internet is exhausted, or whether they continue to hold when training incorporates synthetic data and reasoning traces. Companies including Google DeepMind, Meta AI, and Anthropic treat scaling-law analysis as a core planning discipline, publishing updated compute-optimal recipes alongside new model releases.

Example

Before committing to training a 70-billion-parameter model, a research team runs five small-scale experiments across a range of sizes, fits a power-law curve to the results, and predicts that doubling compute will reduce validation loss by roughly 8%, informing the decision of whether the investment is justified.

Related terms

Pre-training Frontier Model

← Glossary