Demystifying AI Interpretability

Date: Mar 28, 2025
Time: 03:00 PM - 04:15 PM (Local Time Germany)
Speaker: Federico Adolfi
Host: MaxPlanckLaw
Region: Online
Topic: Discussion and debate formats, lectures

This talk will attempt to demystify, for a non-technical audience, the current state of neural network explainability and interpretability, as well as trace the boundaries of what is in principle possible to achieve. We will first set up the necessary background to talk about interpretability methods with stakeholders in mind, define basic concepts, and explain differences such as inner interpretability versus explainability. Along the way, we will touch on issues of relevance to various stakeholders; for instance, the role of interpretability in attempting explanations of how large language models generate text, in revealing reasons for model biases, and in model distillation.

Throughout, we will use a particular lens to demystify what AI interpretability is, and which goals are within or out of its reach: instead of focusing on the promises of (algorithmic) solutions for interpretability, we will focus on the properties of the (computational) problems they attempt to solve. This lens—which we call computational meta-theory—will allow us to put stakeholders’ goals at the centre and to reason about the adequacy of interpretability ‘hammers’ to hit practically meaningful ‘nails’.

Federicoo Adlfi is currently a postdoctoral researcher at the Ernst Strüngmann Institute for Neuroscience, Max Planck Society. He combines a background in cognitive and brain science, computer science, and music. His PhD in Computational Cognitive Science at the University of Bristol focused on establishing a conceptual and formal framework for computational meta-theory and demonstrating its application to problems in psychology, neuroscience, and artificial intelligence. One of these applications is the problem of AI interpretability, for which he and his colleagues recently provided the first formal analyses of the scope and limits of circuit discovery to interpret neural networks.

Demystifying AI Interpretability

Notification Settings