Training

Reinforcement Learning

Reinforcement learning is a machine learning paradigm in which an agent learns a decision-making policy by interacting with an environment and receiving scalar reward signals, optimizing for maximum cumulative reward without requiring a pre-labeled dataset of correct actions.

Reinforcement learning (RL) is a machine learning paradigm in which an autonomous agent learns to make sequential decisions by interacting with an environment. At each step the agent observes the current state, takes an action, receives a scalar reward signal, and transitions to a new state. The objective is to learn a policy — a mapping from states to actions — that maximizes expected cumulative discounted reward over time.

The theoretical foundation of modern RL is the Markov Decision Process (MDP) formalism, developed systematically by Sutton and Barto in "Reinforcement Learning: An Introduction" (1998, 2nd ed. 2018). Key algorithmic families include value-based methods (Q-learning, DQN), which learn an action-value function; policy-gradient methods (REINFORCE, PPO, TRPO), which directly optimize the policy using gradient estimates; and actor-critic methods that combine both. Deep RL, pairing neural networks with these algorithms, enabled landmark results: DeepMind's DQN surpassed human performance on 49 Atari games (2015), and AlphaGo defeated world Go champion Lee Sedol (2016) using a combination of supervised learning, RL, and Monte Carlo Tree Search.

RL is distinct from supervised and unsupervised learning in that it requires no pre-labeled dataset of correct answers. Learning emerges from interaction and delayed reward signals, making RL the natural framework for sequential decision-making where the optimal action sequence cannot be determined in advance — including robotics control, game playing, autonomous driving, and aligning large language models with human intent through RLHF and RLVR.

As of 2026, RL plays a central role across AI subfields. OpenAI's o3, Google DeepMind's Gemini 2.5, and Anthropic's Claude 3.7 Sonnet all cite RL-based post-training as a key component of their reasoning capabilities. In robotics, RL combined with sim-to-real transfer drives manipulation and locomotion in systems from companies including Figure AI and Boston Dynamics. Key open research challenges include sample efficiency, reward specification, and robust generalization to environments not seen during training.

Example

OpenAI uses Proximal Policy Optimization (PPO) during the RLHF post-training stage of InstructGPT and GPT-4 to fine-tune the model against a learned reward model that scores sampled completions, shifting output distributions toward responses human raters prefer.

Related terms

Latest news on this topic

← Glossary