MIT News→ original

MIT reveals LLM secrets: how to find hidden emotions and bias

MIT researchers have developed a method to identify hidden aspects of how large language models work, including biases, emotions, and personality traits. The te

AI-processed from MIT News; edited by Hamidun News
MIT reveals LLM secrets: how to find hidden emotions and bias
Source: MIT News. Collage: Hamidun News.
◐ Listen to article

Large language models have long ceased to be mere text generators — they have become the infrastructure supporting medicine, law, education, and finance. But behind impressive results lies a fundamental problem: no one truly understands what happens inside. Researchers at the Massachusetts Institute of Technology have taken a step that could change this situation — they have developed a method that allows us to peer into the neural network's "black box" and discover something unexpected there: hidden biases, emotional patterns, and even what could be called personality traits of the model.

The problem of AI interpretability has existed as long as neural networks themselves. When GPT-4 or Claude answer a question, they do not reveal the mechanism of their thinking — they simply produce a result. Standard testing allows assessment of answer accuracy, identification of obvious errors, and crude biases. However, subtle, systemic distortions — those that manifest not in a single query, but across thousands of interactions — remain nearly invisible. It is precisely this gap between observable behavior and the model's internal logic that MIT is trying to close.

The new method operates at the level of the neural network's internal states — those intermediate computational layers through which information passes before becoming text. Researchers have learned to read these states as a kind of map of abstract concepts: how the model forms representations of emotions, what associative chains it builds around certain social groups, how its internal "tone" changes depending on the subject of conversation. Essentially, this is the first tool that allows us not simply to ask a model about its biases, but to observe how these biases exist within it — regardless of what the model declares in its responses.

The significance of this approach for AI safety is difficult to overstate. Today, the primary method for detecting dangerous behavior in models is so-called red teaming: teams of specialists manually attempt to provoke the neural network into undesirable responses. This process is labor-intensive, costly, and incomplete by definition — it seeks known threats but is unable to systematically identify unknown ones. The MIT method reverses the logic: instead of attacking the model from outside, it examines it from within. Vulnerabilities can be detected before they manifest in real user interaction. This is a shift from reactive security to preventive security — much like how medicine moves from treating symptoms to early diagnosis.

For the industry, this discovery carries several immediate practical consequences. Companies developing LLMs gain a tool for deeper auditing of their models before release. Regulators, who worldwide are actively seeking AI evaluation standards — from the European AI Act to American executive orders — gain an argument in favor of mandatory analysis of internal states as part of certification. Finally, corporate clients deploying language models in sensitive areas will be able to demand not just reports of accuracy, but documented analysis of hidden patterns.

It is important, however, to understand the limitations of the new method. Detecting a bias does not mean eliminating it. A neural network is not reprogrammed by a researcher seeing something unpleasant in its internal layers. The path from diagnosis to treatment will require separate developments: new fine-tuning techniques, more precise alignment methods, possibly — different architectural solutions. The MIT research is rather the creation of diagnostic equipment than a course of therapy.

Nevertheless, the mere appearance of such a tool changes the conversation about AI ethics. Until now, the discussion of language model bias has been conducted primarily at the level of output data: this model produces toxic content, that one reproduces gender stereotypes. Now there emerges the possibility to speak of the internal architecture of bias — of exactly where and how it forms. This is a qualitatively different level of understanding, and it opens the door to qualitatively different solutions. Large language models remain black boxes for now, but the lid, it seems, has finally begun to open.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…