Habr AI→ original

Subliminal Learning: Do Neural Networks Remember the Forgotten?

Исследование показывает, что полное удаление информации из нейросети при дообучении практически невозможно. Эффект связности мод и структурный импринтинг сохран

AI-processed from Habr AI; edited by Hamidun News
Subliminal Learning: Do Neural Networks Remember the Forgotten?
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

In previous articles, we touched upon the topic of subliminal learning in neural networks, raising more questions than providing answers. It is time to delve deeper into this phenomenon, drawing on new experiments and code analysis. One of the key questions in the field of AI Alignment and the safety of large language models (LLMs) is as follows: is fine-tuning or training with reinforcement learning from human feedback (RLHF) a reliable way to remove unwanted or dangerous information initially embedded in the model?

Experiments show that the well-known effect of mode connectivity makes complete erasure of information obtained during the pretraining stage practically impossible with standard fine-tuning. The essence is that the structural "imprint" (imprinting) is preserved in the topology of the neural network's weights and can be read through a kind of "subliminal" channel. Even with full parameter unfreezing (i.e., the ability to change all network parameters) and the application of aggressive L2 regularization aimed at actively "forgetting" old knowledge, the topology of the latent space formed during pretraining is preserved and continues to have a substantial impact on solving the new task. The accuracy of reproducing old knowledge, seemingly deleted, can reach 88-99%.

This mode connectivity effect can be explained as follows: the loss landscape of a neural network (i.e., the function it tries to minimize during training) has a complex structure with many local minima. Each of these minima corresponds to a specific "mode" or way of solving a task. Mode connectivity means that these minima are connected by "paths" with relatively low loss, allowing the model to switch between different operating modes while preserving the overall structure of knowledge.

The implications of this discovery for the safety and reliability of LLMs are enormous. If unwanted information cannot be completely removed, then there is a risk of its "manifestation" at the most inopportune moment, for example, during text generation, user interaction, or decision-making. This is especially dangerous in the context of models used in critical areas such as healthcare, finance, or justice.

Moreover, the research results question the effectiveness of existing AI Alignment methods aimed at controlling and managing LLM behavior. If the model retains hidden knowledge that is not subject to direct control, then it is necessary to develop new, more advanced methods that account for this subliminal learning effect.

One possible direction is the development of neural network architectures that are more resistant to the retention of unwanted information. Another is the development of more efficient fine-tuning methods that allow not only adapting the model to a new task but also actively "forgetting" old knowledge without destroying its overall structure.

In conclusion, research into subliminal learning in neural networks emphasizes that fine-tuning and RLHF are not a panacea for unwanted information. Structural imprinting in weight topology is preserved and can be activated. This requires the development of new approaches to AI Alignment that account for this effect and are aimed at creating safer and more reliable LLMs.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…