Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability

Atticus Geiger, Ibeling, Duligur, Zur, Amir et al.

2023 · arXiv (Cornell University) · 10 citations

Causal abstraction provides a theoretical foundation for mechanistic interpretability, the field concerned with providing intelligible algorithms that are faithful simplifications of the known, but opaque low-level details of black box AI models. Our contributions are (1) generalizing the theory of causal abstraction from mechanism replacement (i.e., hard and soft interventions) to arbitrary mechanism transformation (i.e., functionals from old mechanisms to new mechanisms), (2) providing a flexible, yet precise formalization for the core concepts of polysemantic neurons, the linear representa…

Read the paper →

Explore this paper's citation graph on Constellation.