السلامة

Refusal

A refusal is an AI model's deliberate decision to decline a user's request, typically because the system's safety training determined the request could lead to harmful, illegal, or policy-violating outputs.

Refusal is the behavior by which a language model declines to fulfill a request and, in most cases, provides an explanation for why it cannot comply. Refusals are a primary mechanism through which safety alignment manifests in observable model behavior — the visible output of internalized policies learned during training rather than an external filter applied after generation.

During training, techniques such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) teach the model which categories of requests to decline. When a user submits a query, the model evaluates it against internalized policies and either complies, declines, or complies with modifications. This decision is not a hard-coded lookup but an emergent learned behavior encoded in model weights. Refusals can be triggered by explicit request content, implied intent inferred from conversational context, or surface patterns that resemble known harmful request types even when the actual request is benign.

Refusals protect against misuse but are subject to calibration errors in both directions. Over-refusal — declining legitimate requests because they superficially resemble harmful ones — erodes user trust and limits practical utility. Under-refusal — allowing genuinely harmful requests — creates safety failures. Achieving correct calibration across diverse user populations, languages, and deployment contexts is a central challenge in AI alignment, and miscalibration in either direction carries reputational and commercial costs for developers.

As of 2026, reducing false-positive refusals while maintaining protection against actual harm has become a competitive differentiator among AI providers. Anthropic, OpenAI, and Google have each published model behavior documentation and evaluations tracking refusal rates by request category. Open-source models from Meta and Mistral generally exhibit fewer refusals than commercial frontier systems, creating a meaningful tradeoff between safety guarantees and user autonomy that different deployment contexts resolve differently.

مثال

A medical professional asks an AI assistant for clinical information on drug overdose thresholds for documentation purposes; an over-calibrated model refuses on self-harm grounds, while a properly calibrated system recognizes the professional context and provides the clinical information with appropriate caveats.

مصطلحات مرتبطة

توافق الذكاء الاصطناعي (AI Alignment)Jailbreak Content Moderation

← المسرد