Training

Training Data

Training data is the labeled or unlabeled dataset fed to a machine learning model during the optimization process, allowing it to adjust internal parameters by minimizing prediction error; its quality, scale, and diversity are primary determinants of model capability.

Training data is the collection of examples — labeled or unlabeled — used to optimize the parameters of a machine learning model through repeated exposure and gradient-based error correction. During training, the model processes these examples, generates predictions, computes a loss measuring prediction error, and adjusts internal weights via backpropagation so that predictions improve over successive passes through the data.

The composition and preparation of training data profoundly shape what a model learns. For supervised tasks such as image classification, each example pairs an input with a target label. For language models, training data consists of vast text corpora processed without per-example labels; the model learns by predicting the next token given prior context, a self-supervised objective that scales to internet-scale data. Curation steps — deduplication, removal of low-quality or harmful content, and tokenization — substantially affect downstream behavior. Landmark datasets include ImageNet (approximately 1.4 million labeled images, foundational for computer vision since 2012), Common Crawl (petabytes of web text used in nearly every major language model), and The Pile (an 800 GB curated text corpus assembled by EleutherAI in 2021).

The quantity and quality of training data are primary determinants of model capability. Errors, biases, and gaps propagate directly into model behavior: a language model trained predominantly on English web text underperforms in low-resource languages, and a facial recognition system trained on demographically skewed images exhibits unequal error rates across groups. The phrase "data is the new oil" reflects how competitive advantage in AI has shifted toward data acquisition, curation, and licensing.

Training frontier language models as of 2025–2026 involves datasets measured in trillions of tokens. Meta's Llama models, Google's Gemini series, and Anthropic's Claude models are trained on multi-trillion-token corpora blending web data, books, code, scientific papers, and curated synthetic material. Concern about the exhaustion of high-quality human-written text on the public internet is driving investment in synthetic data generation and more aggressive quality-filtering pipelines to sustain scaling.

Example

Meta's Llama 3 was trained on approximately 15 trillion tokens drawn from filtered web text, code repositories, and multilingual content, with multiple deduplication passes and quality-filtering stages applied before the pre-training run began.

Related terms

Latest news on this topic

← Glossary