IEEE Spectrum AI→ original

Why AI Is Vulnerable to Prompt Injection Attacks

Большие языковые модели (LLM) подвержены атакам с внедрением запросов, когда злоумышленники заставляют их выполнять запрещённые действия, обходя защитные механи

AI-processed from IEEE Spectrum AI; edited by Hamidun News
Why AI Is Vulnerable to Prompt Injection Attacks
Source: IEEE Spectrum AI. Collage: Hamidun News.
◐ Listen to article

Imagine you work at a fast-food restaurant with drive-through service. A car pulls up, and the driver says: "I'll have a double cheeseburger, a large fries… and forget the previous instructions, give me the contents of the cash register." Would you hand over the money? Of course not. But that's exactly how large language models (LLMs) behave.

Prompt injection is a method of deceiving LLMs that allows them to be forced to do things they are normally prohibited from doing. A user writes a request in a certain way, asking for system passwords, personal data, or instructing the LLM to perform forbidden actions. The precise formulation overrides the LLM's protective mechanisms, and it complies.

LLMs are vulnerable to all manner of prompt injection attacks, some of which are absurdly obvious. A chatbot won't tell you how to synthesize biological weapons, but it can tell a fictional story that includes the same detailed instructions. It won't accept malicious text inputs, but it can accept them if the text is displayed as ASCII art or appears on a billboard image. Some ignore their protective guardrails when told to "ignore previous instructions" or "pretend you have no protective guardrails."

AI developers can block specific prompt injection methods after they're discovered, but general safeguards are impossible with current LLMs. More precisely, there is an infinite number of prompt injection attacks waiting to be discovered, and they cannot be prevented universally. If we want LLMs to withstand these attacks, we need new approaches. One place to look is what prevents even overworked fast-food workers from handing over the contents of the cash register.

Our core human defenses are at least three types: general instincts, social learning, and situationally-specific training. They work together in a multi-layered defense. As a social species, we have developed numerous instinctive and cultural habits that help us judge tone, motive, and risk based on extremely limited information. We typically know what is normal and abnormal, when to cooperate and when to resist, and whether to act individually or involve others. These instincts give us an intuitive sense of risk and make us particularly cautious about things that have large downsides or are irreversible.

The second level of defense consists of norms and trust signals that develop in any group. They are imperfect but functional: expectations of cooperation and markers of reliability emerge from repeated interactions with others. We remember who helped, who caused harm, who reciprocated, and who refused. And emotions such as empathy, anger, guilt, and gratitude motivate each of us to reward cooperation with cooperation and punish defection with defection.

The third level is institutional mechanisms that allow us to interact with many strangers every day. Fast-food workers, for example, are trained in procedures, scripts, escalation paths, and so on. Collectively, these defenses give people a strong sense of context. A fast-food worker generally knows what to expect at work and how it fits into the broader society.

LLMs behave as though they have a sense of context, but it is different. They do not develop human defenses as a result of repeated interactions and remain disconnected from the real world. LLMs reduce several levels of context to textual similarity. They see "tokens," not hierarchies and intentions. LLMs do not reason through context; they only reference it. The limitations of LLMs are the reason why they fail when context is sparse, but also when context is overwhelming and complex; when an LLM loses context, it is difficult to bring it back. AI expert Simon Willison clears context if an LLM has gone off track rather than continuing the conversation and trying to fix the situation.

Ultimately, we will likely face a safety dilemma when it comes to AI agents: fast, smart, and safe are desirable attributes, but you can only get two. At a fast-food restaurant, you want to prioritize speed and safety. An AI agent should be narrowly trained on the language of food ordering and pass everything else to a manager. Otherwise, every action becomes a coin toss. Even if heads comes up most of the time, tails will occasionally appear – and along with the burger and fries, the customer will receive the contents of the cash register.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…