Training

Reinforcement Learning from Human Feedback (RLHF)

Reinforcement Learning from Human Feedback (RLHF) is a training pipeline that collects human preference judgments between model outputs, trains a reward model on those judgments, and uses reinforcement learning to fine-tune the language model toward behavior humans rate more highly.

Reinforcement Learning from Human Feedback (RLHF) is a multi-stage alignment training pipeline designed to improve a language model's outputs along dimensions—helpfulness, accuracy, harmlessness—that are difficult to specify precisely as supervised learning targets but can be reliably judged by humans comparing pairs of responses.

RLHF typically proceeds in three stages. First, the base model is fine-tuned on high-quality demonstrations via supervised learning (instruction tuning). Second, human annotators are shown pairs of model outputs for the same prompt and indicate which is preferable; these comparisons are used to train a separate reward model that learns to predict human preference scores. Third, the language model's parameters are updated using a reinforcement learning algorithm—most commonly Proximal Policy Optimization (PPO)—to maximize the reward model's scores while a KL-divergence penalty keeps the updated policy close to the supervised baseline, preventing reward hacking or excessive drift.

RLHF was the core technique behind InstructGPT (OpenAI, early 2022), which demonstrated that preference-based training dramatically improved a model's practical usefulness and reduced harmful outputs relative to simple instruction tuning. ChatGPT, launched in November 2022 and built on the same pipeline, drew widespread public attention to the approach. Anthropic's Claude and Google's Gemini models also apply preference-based alignment in their training pipelines, and RLHF became the de facto standard for deploying commercial language models through 2023–2024.

By 2026, PPO-based RLHF has been supplemented or replaced by simpler alternatives in many production pipelines. Direct Preference Optimization (DPO), introduced in 2023, reformulates preference alignment as a supervised loss directly on the language model, eliminating the separate reward model and RL training loop. Variants including IPO, KTO, and ORPO offer additional trade-offs in stability and data efficiency. Reward models trained from human comparisons continue to be used in evaluation, data filtering, and as judges in model-based assessment frameworks.

Example

OpenAI applied RLHF to a GPT-3.5 base to produce InstructGPT: human raters compared pairs of model outputs and labeled their preferences, a reward model was trained on those labels, and PPO updated the policy—producing a model rated substantially more helpful than the unaligned baseline by independent evaluators.

← Glossary

Reinforcement Learning from Human Feedback (RLHF)

Example

Related terms