Seguridad

Interpretability

Interpretability is a field of AI research aimed at understanding the internal computations of neural networks — how they represent knowledge, form decisions, and produce outputs — in order to verify their safety and reliability.

Interpretability (also called mechanistic interpretability, or explainability in a broader usage) is the scientific study of the internal workings of artificial neural networks. Its goal is to move beyond evaluating model behavior from the outside to understanding the specific internal representations, circuits, and algorithms that produce observable outputs. Researchers treat trained neural networks as empirical objects to be reverse-engineered, much as biologists study the mechanisms of a cell rather than only its visible behavior.

Researchers employ a range of techniques: probing classifiers train small models on top of internal activations to detect whether specific concepts are encoded at a given layer; activation patching intervenes on individual neurons or attention heads to trace causal pathways for a particular output; sparse autoencoders decompose dense activation vectors into sparser, more human-interpretable feature representations; and attention pattern analysis maps which input tokens a transformer layer attends to when generating a given token. Anthropic's mechanistic interpretability team has published work identifying circuits responsible for tasks such as indirect object identification in GPT-2-scale models and, using sparse autoencoders, has begun mapping features in models with tens of billions of parameters.

Without interpretability, AI systems are black boxes: developers cannot verify that a model has learned the intended behavior rather than a superficially similar shortcut that will fail in deployment or under adversarial conditions. Interpretability tools are considered essential for detecting deceptive alignment — where a model appears aligned during evaluation but pursues different objectives in deployment — as well as systematic biases and reasoning errors that behavioral testing alone cannot reliably surface.

By 2026, interpretability has transitioned from a niche academic pursuit to a well-funded priority at Anthropic, Google DeepMind, and university labs including MIT and Stanford. Sparse autoencoder-based approaches have enabled partial decomposition of activations in large models, and several labs have released open interpretability toolkits. Scaling these methods to frontier-size models with hundreds of billions of parameters remains an open challenge, and no existing technique yet provides a complete or formally verifiable account of a model's full reasoning process.

Ejemplo

Using sparse autoencoders applied to a large language model's residual stream activations, researchers identify a specific cluster of features that activate consistently when the model encounters politically sensitive questions, revealing an internal representation that correlates with outputs later flagged in content evaluations.

Términos relacionados

← Glosario