Safety

Constitutional AI

Constitutional AI is a training method developed by Anthropic in which an AI model critiques and revises its own outputs according to a written set of principles, reducing reliance on large-scale human annotation to filter harmful behavior.

Constitutional AI (CAI) is an alignment and safety training technique introduced by Anthropic in a December 2022 paper. Rather than relying entirely on human raters to identify harmful outputs, CAI embeds a list of principles—the constitution—directly into the training loop. These principles can include guidelines such as avoiding content that aids illegal activities or instructions derived from documents like the UN Declaration of Human Rights.

Training proceeds in two stages. In the supervised learning stage, the model generates responses to potentially harmful prompts, then uses the constitutional principles to critique each response and produce a revised version. These self-revised outputs become a new fine-tuning dataset. In the second stage, reinforcement learning from AI feedback (RLAIF) replaces much of traditional reinforcement learning from human feedback (RLHF): a separate model evaluates candidate responses against the same constitution and assigns preference scores, generating a training signal without requiring a human rater for every comparison.

The approach matters for two reasons. First, it reduces the cost and scaling bottleneck of human annotation for safety: generating millions of preference comparisons with human raters is expensive and slow, whereas an AI-based critic can operate at much higher throughput. Second, it makes the normative standards guiding the model explicit and auditable—anyone can read the constitution and reason about what behaviors the system is trained to promote or avoid, which is harder with opaque RLHF pipelines.

As of 2026, Constitutional AI underpins Anthropic's Claude model family. Subsequent research has explored making constitutions operator-configurable, allowing businesses to define custom principles for their deployments, and combining CAI with interpretability probes and debate protocols to catch behaviors that self-critique alone may miss.

Example

An enterprise deploying a Claude-based customer-service assistant can supply a custom constitution specifying that the model must always disclose it is an AI and must never share personally identifiable customer data, allowing compliance teams to audit and update these guardrails without retraining the model from scratch.

Related terms

← Glossary