-

@ halalmoney
2025-04-28 15:40:53
Snippets from The Urgency of Interpretability, Dario Amodei
People outside the field are often surprised and alarmed to learn that we do not understand how our own AI creations work. They are right to be concerned: this lack of understanding is essentially unprecedented in the history of technology. For several years, we (both Anthropic and the field at large) have been trying to solve this problem, to create the analogue of a highly precise and accurate MRI that would fully reveal the inner workings of an AI model.
-
As my friend and co-founder Chris Olah is fond of saying, generative AI systems are grown more than they are built—their internal mechanisms are “emergent” rather than directly designed. It’s a bit like growing a plant or a bacterial colony: we set the high-level conditions that direct and shape growth1, but the exact structure which emerges is unpredictable and difficult to understand or explain.
-
Our inability to understand models’ internal mechanisms means that we cannot meaningfully predict such behaviors, and therefore struggle to rule them out; indeed, models do exhibit unexpected emergent behaviors, though none that have yet risen to major levels of concern. More subtly, the same opacity makes it hard to find definitive evidence supporting the existence of these risks at a large scale, making it hard to rally support for addressing them—and indeed, hard to know for sure how dangerous they are.
-
The model uses superposition because this allows it to express more concepts than it has neurons, enabling it to learn more. If superposition seems tangled and difficult to understand, that’s because, as ever, the learning and operation of AI models are not optimized in the slightest to be legible to humans.
-
…we employed a method called autointerpretability—which uses an AI system itself to analyze interpretability features—to scale the process of not just finding the features, but listing and identifying what they mean in human terms.
-
Our long-run aspiration is to be able to look at a state-of-the-art model and essentially do a “brain scan”: a checkup that has a high probability of identifying a wide range of issues including tendencies to lie or deceive, power-seeking, flaws in jailbreaks, cognitive strengths and weaknesses of the model as a whole, and much more. This would then be used in tandem with the various techniques for training and aligning models, a bit like how a doctor might do an MRI to diagnose a disease, then prescribe a drug to treat it, then do another MRI to see how the treatment is progressing, and so on.
-
if the US and China reach powerful AI simultaneously (which is what I expect to happen without export controls), the geopolitical incentives will make any slowdown at all essentially impossible.
https://www.darioamodei.com/post/the-urgency-of-interpretability?utm_source=www.therundown.ai&utm_medium=newsletter&utm_campaign=china-declares-ai-independence&_bhlid=7d05e9944ba7adc88e35e3bddf4c9474022e328a