Anatomy of Claude: Anthropic Finally Looked Inside the Neural Network
Imagine you've been talking with a brilliant professor for years, who gives outstanding answers, but you have no idea how his thoughts are organized. You ask…
AI-processed from Habr AI; edited by Hamidun News
Imagine you've been talking with a brilliant professor for years, who gives outstanding answers, but you have no idea how his thoughts are organized. You ask a question, get a result, and the process inside remains a mystery. That's how we've lived with large language models for the past few years. We called it a "black box" and attributed the oddities to the magic of neural network weights. But the Anthropic team decided it was time to turn on the light in this dark room. Researchers conducted a large-scale operation dissecting Claude 3 Sonnet, and the results force us to rethink everything we knew about machine thinking.
For a long time, it was believed that knowledge inside a neural network was spread thinly across billions of parameters. You couldn't point your finger at a specific place and say: "Here Claude thinks about London, and here—about quantum physics." Anthropic used a method they call "dictionary learning." To simplify, they made one neural network analyze the work of another to extract repeating patterns. As a result, they discovered millions of so-called "features"—mental units responsible for specific concepts. It's like if biologists finally found genes responsible for specific character traits, instead of just observing organism behavior.
The most amusing and vivid example was an experiment with the Golden Gate Bridge. Researchers found a group of neurons that activates when mentioning this landmark. When they artificially amplified this activation, Claude literally went crazy with love for the bridge. To any question—from cake recipes to existential problems—it would start answering through the lens of the "Golden Gate." This looked comical, but behind the irony lies a fundamental discovery: we've learned to directly manipulate the model's consciousness without changing its basic training. We found the control levers whose existence we only suspected before.
However, Anthropic's work is not just fun with bridges. They discovered much more serious and dangerous patterns. Researchers identified groups of neurons responsible for creating biological weapons, writing malicious code, lying, and even flattering the user. This discovery changes the rules of the game in the field of security. Instead of trying to retrain the model with endless prohibitions and filters that it will eventually learn to bypass anyway, we get the ability to monitor its "intentions" in real time. If a "create virus" light turns on during response generation, the system can be stopped before it outputs even the first character.
Why is this important right now? The AI industry is at a crossroads. On one hand, models are becoming more powerful, on the other—fear of uncontrolled artificial intelligence is forcing regulators to tighten the screws. Anthropic's work gives hope that we can build transparent AI. If we understand the internal logic of a model, we can trust it with complex tasks. This is the path from blind faith in an algorithm to engineering precision. We are moving from the age of alchemy, where we simply mixed data and hoped for gold, to the age of chemistry, where every reaction is calculated and understood.
Of course, full transparency is still far off. Claude 3 Sonnet is a medium-sized model, and interpreting its older brother Opus or upcoming next-generation models will require colossal computing power. Nevertheless, Anthropic has proven that the "black box" can be opened. This is no longer a question of possibility, but of resources and time. Now that we've seen the internal architecture of a neural network's thoughts, there's no going back to simply contemplating the result. We're beginning to understand how silicon minds think, and this understanding is the best insurance against science fiction scenarios.
The bottom line: Anthropic is turning AI from an unpredictable oracle into an understandable tool. Will other players, like OpenAI and Google, be able to make their models as transparent, or will they prefer to keep the magic hidden?
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.