Mechanistic interpretability — citation graph of key research
Explore the most influential research on mechanistic interpretability as an interactive citation graph on Constellation. The papers below are connected by direct citations and shared references — open any one to center the graph on it and discover related work.
Top papers on mechanistic interpretability
- Progress measures for grokking via mechanistic interpretability
Neel Nanda, Lawrence Chan, Tom Lieberum et al. — 2023 · arXiv (Cornell University) · 54 citations - Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska, Efstratios Gavves — 2024 · arXiv (Cornell University) · 26 citations - Towards Automated Circuit Discovery for Mechanistic Interpretability
Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch et al. — 2023 · arXiv (Cornell University) · 32 citations - Explaining AI through mechanistic interpretability
Lena Kästner, Barnaby Crook — 2024 · European Journal for Philosophy of Science · 28 citations - Seeing Is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability
Ziming Liu, Eric Gan, Max Tegmark — 2023 · Entropy · 21 citations - Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability
Atticus Geiger, Ibeling, Duligur, Zur, Amir et al. — 2023 · arXiv (Cornell University) · 10 citations - Modeling bioconcentration factor (BCF) using mechanistically interpretable descriptors computed from open source tool “PaDEL-Descriptor”
Subrata Pramanik, Kunal Roy — 2013 · Environmental Science and Pollution Research · 27 citations - Survey on the Role of Mechanistic Interpretability in Generative AI
Leonardo Ranaldi — 2025 · Big Data and Cognitive Computing · 10 citations - From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models
E. Charles Adams, Liam Bai, Minji Lee et al. — 2025 · bioRxiv (Cold Spring Harbor Laboratory) · 17 citations - A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai, Yilun Zhou, Feng, Shi et al. — 2024 · arXiv (Cornell University) · 10 citations - Open Problems in Mechanistic Interpretability
Lee Sharkey, Bilal Chughtai, Joshua Batson et al. — 2025 · ArXiv.org · 5 citations - Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP
Vedant Palit, Rohan Pandey, Aryaman Arora et al. — 2023 · 9 citations