Training

Pre-training

Pre-training is the initial large-scale training phase in which a neural network learns general representations from a massive corpus using self-supervised objectives, before any task-specific fine-tuning.

Pre-training is the computationally dominant first phase of modern large-scale AI model development, in which a neural network is trained on a massive, broadly diverse dataset—often hundreds of billions to trillions of tokens for language models—to acquire general-purpose representations of language, factual knowledge, and reasoning patterns.

For large language models, the dominant pre-training objective is autoregressive next-token prediction: given a sequence of tokens, the model learns to predict the next token by minimizing cross-entropy loss across billions of examples. Encoder-only models such as BERT use masked language modeling instead, predicting randomly occluded tokens. In both cases, no manually curated labels are required because the supervision signal is derived directly from the raw data (self-supervised learning). Pre-training is distributed across thousands of GPUs or TPUs over weeks or months, with compute requirements measured in tens of thousands to millions of GPU-hours.

Pre-training is responsible for the broad world knowledge, linguistic competence, and reasoning capacity that make large models useful across many tasks without retraining from scratch. A pre-trained model can subsequently be adapted to specific tasks or behaviors through fine-tuning—including instruction tuning and RLHF—at a fraction of the original training cost. This transfer learning paradigm has become the dominant approach in natural language processing, computer vision, and multimodal AI.

Contemporary pre-training runs use datasets assembled from web crawls (Common Crawl), books, code repositories, scientific papers, and multilingual sources, often totaling 10–30 trillion tokens. Multimodal pre-training—combining text with images, audio, and video—has become standard, with models such as GPT-4o and Gemini 1.5 learning joint representations across modalities. Efficient training techniques such as FlashAttention, tensor and pipeline parallelism, and mixed-precision arithmetic allow training runs to complete within practical time and energy budgets.

Example

Meta pre-trained LLaMA 3's 70-billion-parameter variant on approximately 15 trillion tokens of multilingual text and code using thousands of Nvidia H100 GPUs over several months; the resulting checkpoint was then released publicly for others to fine-tune for specific applications such as code generation or document summarization.

Related terms

← Glossary