Habr AI→ original

A Single Suffix Breaks Any LLM: Researchers Found One Universal Refusal Vector

Researchers discovered that different methods of bypassing LLM defenses — GCG (adds garbage suffixes) and AutoDAN (adds coherent text) — exploit one and the…

AI-processed from Habr AI; edited by Hamidun News
A Single Suffix Breaks Any LLM: Researchers Found One Universal Refusal Vector
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

Researchers show: despite the apparent diversity of adversarial attacks on language models, they all exploit a single structural weakness — a unified "refusal direction" vector in the activation space. A single well-crafted suffix can jailbreak any model, even if the attack has never seen it before.

Two attacks, one vulnerability point

The most well-known methods for bypassing LLM defenses — GCG (Greedy Coordinate Gradient) and AutoDAN — operate on fundamentally different principles. GCG adds a suffix of randomly optimized tokens to a harmful request: externally it looks like gibberish, but the string is tuned through gradient descent so that the model shifts toward executing the request. AutoDAN works differently — it generates readable, grammatically correct text-appended through evolutionary search or an auxiliary language model. Noise versus meaning, token-chaos versus coherent prose. Yet under the hood, both methods perform the same action in the same place.

  • GCG optimizes tokens directly through gradient on the loss function
  • AutoDAN uses evolutionary search or an auxiliary LLM for generation
  • Both add a suffix to the original harmful request
  • Both transfer equally well to models the attack has never seen

What is refusal direction

When a language model refuses a harmful request, it's not the work of a complex branching system of topical filters. In the space of the model's internal activations, there exists a single vector — a "refusal direction". When request representations project along it — the model refuses. When activations shift in the opposite direction — the model executes the request. It's important to understand that this is not a metaphor, but a concrete mathematical object. Researchers find it using a method of mean activation difference: they compare how the model represents a "normal" and "harmful" request, and the difference between these means is the refusal direction.

Years of training with human preference feedback (RLHF) did not create a multi-layered defense. They concentrated all the "will to refuse" on a single geometric axis of the activation space. The fact that different independent attacks, developed by different teams, ultimately found the same object itself speaks to the structural nature of the phenomenon.

"All safety robustness hangs on a single vector.

This is not a bug in a specific implementation — it's a structural property of how alignment through RLHF works."

Why universality of attacks is not coincidence

If a suffix shifts activations away from refusal direction, it works against any model with similar training — even if the attacker has never seen it. This explains a long-observed phenomenon: suffixes found on open models (Llama, Mistral) bypass closed commercial systems. Suffixes from GPT-3.5 worked against GPT-4. The reason is not in weight leakage or identical data — it's that all modern RLHF models encode refusal in a similar geometric object.

  • The attacker does not need direct access to the target model — any proxy with similar training is enough
  • The suffix can be unreadable garbage or coherent text — both variants hit the same point
  • Public attacks on open models automatically become a threat to proprietary systems

What does this mean

If all defense against harmful outputs depends on a single geometric object in latent space, the question arises: is it enough to "patch" this vector during fine-tuning — or does it require a fundamentally different training architecture? Some researchers propose surgical removal of the direction from the model at inference time, but this degrades overall quality. The fact that independent attacks of different types converged to refusal direction speaks to a structural property of modern LLMs — and this is the frontier where AI safety has not yet found an answer.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…