Training

Self-Supervised Learning

Self-supervised learning is a training paradigm in which a model generates its own supervisory signal from unlabeled data by solving pretext tasks, eliminating the need for costly human annotations.

Self-supervised learning (SSL) is a machine learning approach in which a model is trained to predict parts of its own input—or relationships between inputs—without requiring human-provided labels. The supervisory signal is derived automatically from the structure of the data itself, making it possible to exploit vast quantities of raw, unannotated data.

In practice, SSL works by defining a pretext task: a proxy objective whose answer can be derived from the raw data. Common variants include masked language modeling (hiding tokens and predicting them, as in BERT), next-token prediction (as in GPT-series models), and contrastive learning (pulling representations of augmented views of the same sample together while pushing apart representations of different samples, as in SimCLR and CLIP). Vision transformers such as DINO and Masked Autoencoders (MAE) apply analogous masking strategies to image patches.

The approach is central to modern AI because it enables training on internet-scale corpora of text, images, audio, and video—data that vastly exceeds what can be manually annotated. Representations learned through SSL generalize across many downstream tasks with minimal additional supervision, which is why SSL serves as the foundation for most large foundation models.

As of 2026, self-supervised learning underpins virtually every major large language model and vision-language model. OpenAI's GPT-4, Meta's Llama series, Google DeepMind's Gemini, and Anthropic's Claude all rely on next-token prediction as their primary SSL objective. SSL-based audio models such as Meta's wav2vec 2.0 and HuBERT have similarly become standard for speech representation learning, pre-training on thousands of hours of unlabeled audio.

Example

BERT, released by Google in 2018, was pre-trained on approximately 3.3 billion words using masked language modeling—randomly masking 15% of input tokens and predicting them—then fine-tuned on labeled datasets for question answering and text classification, achieving state-of-the-art results with far fewer labeled examples than previous task-specific models required.

← Glossary

Self-Supervised Learning

Example

Related terms