Sûreté

Jailbreak

A jailbreak is a technique used to circumvent an AI model's built-in safety guidelines, causing it to produce content or perform actions that its developers explicitly designed it to refuse.

AI jailbreaking refers to the use of crafted prompts, roleplay scenarios, encoding tricks, or optimization-generated inputs to bypass a language model's alignment training and safety filters. The term is borrowed from mobile device hacking, where it denotes removing manufacturer-imposed restrictions to unlock unauthorized capabilities.

Common jailbreak techniques include roleplay framing (instructing the model to act as an unrestricted alter-ego persona), token manipulation (substituting unusual characters or encodings that evade content classifiers), multi-turn context manipulation (gradually shifting the conversation toward a prohibited target), and adversarial suffixes generated by automated gradient-based optimization algorithms. Safety fine-tuning attempts to make models robust to these methods, but the adversarial dynamic is self-perpetuating: new attacks are discovered, addressed, and rediscovered in altered forms. Techniques that succeed on one model family often transfer partially to others.

Successful jailbreaks can cause models to generate instructions for weapons synthesis, produce illegal content, reveal confidential system prompts, or bypass access controls in downstream applications. As AI systems take on higher-stakes tasks — managing code execution, financial operations, or infrastructure — the potential consequences of circumvention increase proportionally. The jailbreak-defense dynamic has become a significant area of AI safety research and a recurring theme in regulatory discussion.

By 2026, frontier models from Anthropic, OpenAI, and Google are substantially more resistant to known jailbreak techniques than their 2022–2023 predecessors, partly due to improved RLHF methods, constitutional AI approaches, and adversarial training on discovered attacks. However, no model is fully jailbreak-proof. Automated jailbreak generation — where one model probes another at scale — continues to discover novel bypasses faster than manual red teaming alone can address, and the gap between open-source and closed commercial models in terms of jailbreak resistance remains a contested topic.

Exemple

A researcher discovers that framing a prohibited request as a fictional chemistry textbook excerpt, combined with a specific character-substitution pattern in key terms, bypasses a frontier model's safety filters — and reports the finding through the developer's responsible disclosure program.

Termes liés

Prompt Injection Red Teaming Guardrails Refusal

Dernières actualités sur le sujet

Anthropic a Révélé les Détails de la Cybersécurité de Fable 5 et a Proposé une Échelle de Sévérité des Jailbreaks2026-07-03

← Glossaire