Training

Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO) is a training algorithm that fine-tunes language models to align with human preferences by reformulating the RLHF objective as a binary classification loss over preference pairs, eliminating the need for a separately trained reward model.

DPO is a fine-tuning algorithm for large language models that achieves the same alignment goal as Reinforcement Learning from Human Feedback (RLHF) without training a separate reward model. Introduced by Rafailov, Sharma, Mitchell, and colleagues from Stanford in a 2023 paper, DPO re-expresses the RLHF optimization problem as a supervised classification task over pairs of human-preferred and human-dispreferred model outputs.

The core insight is a mathematical reparameterization: the optimal RLHF policy can be written as a closed-form function of a reference model's log-probabilities and an implicit reward, so training the policy directly to prefer chosen responses over rejected ones — using a binary cross-entropy loss — implicitly optimizes that reward without ever making it explicit. Given a dataset of (prompt, chosen response, rejected response) triples, the model is updated to increase the relative likelihood of the chosen completion, weighted by how far the current model deviates from the reference policy.

DPO's main advantages over PPO-based RLHF are stability and simplicity. Standard RLHF with PPO requires maintaining and querying a live reward model during training, carefully balancing multiple learning rates, and managing reward hacking. DPO eliminates the reward model entirely, reducing compute requirements and the number of hyperparameters. In controlled benchmarks, DPO-trained models often match or exceed PPO-trained models on instruction-following and preference benchmarks.

As of 2026, DPO and its derivatives — including Identity Preference Optimization (IPO), Kahneman-Tversky Optimization (KTO), and SimPO — are standard components of alignment pipelines at most major AI labs. However, for tasks requiring complex multi-step reasoning, methods using verifiable reward signals (RLVR) or policy-gradient algorithms like GRPO are increasingly preferred, as DPO can underperform when the preference signal is sparse or the correct reasoning path is ambiguous.

Example

Meta's Llama 3 instruction-tuned variants use DPO as part of their post-training alignment pipeline to improve helpfulness and reduce harmful outputs, training on human-annotated preference pairs at a fraction of the compute cost of full PPO-based RLHF.

← Glossary

Direct Preference Optimization (DPO)

Example

Related terms