Mechanistic interpretability — citation graph of key research

Explore the most influential research on mechanistic interpretability as an interactive citation graph on Constellation. The papers below are connected by direct citations and shared references — open any one to center the graph on it and discover related work.

Top papers on mechanistic interpretability

  1. Progress measures for grokking via mechanistic interpretability
    Neel Nanda, Lawrence Chan, Tom Lieberum et al. — 2023 · arXiv (Cornell University) · 54 citations
  2. Mechanistic Interpretability for AI Safety -- A Review
    Leonard Bereska, Efstratios Gavves — 2024 · arXiv (Cornell University) · 26 citations
  3. Towards Automated Circuit Discovery for Mechanistic Interpretability
    Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch et al. — 2023 · arXiv (Cornell University) · 32 citations
  4. Explaining AI through mechanistic interpretability
    Lena Kästner, Barnaby Crook — 2024 · European Journal for Philosophy of Science · 28 citations
  5. Seeing Is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability
    Ziming Liu, Eric Gan, Max Tegmark — 2023 · Entropy · 21 citations
  6. Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability
    Atticus Geiger, Ibeling, Duligur, Zur, Amir et al. — 2023 · arXiv (Cornell University) · 10 citations
  7. Modeling bioconcentration factor (BCF) using mechanistically interpretable descriptors computed from open source tool “PaDEL-Descriptor”
    Subrata Pramanik, Kunal Roy — 2013 · Environmental Science and Pollution Research · 27 citations
  8. Survey on the Role of Mechanistic Interpretability in Generative AI
    Leonardo Ranaldi — 2025 · Big Data and Cognitive Computing · 10 citations
  9. From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models
    E. Charles Adams, Liam Bai, Minji Lee et al. — 2025 · bioRxiv (Cold Spring Harbor Laboratory) · 17 citations
  10. A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
    Daking Rai, Yilun Zhou, Feng, Shi et al. — 2024 · arXiv (Cornell University) · 10 citations
  11. Open Problems in Mechanistic Interpretability
    Lee Sharkey, Bilal Chughtai, Joshua Batson et al. — 2025 · ArXiv.org · 5 citations
  12. Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP
    Vedant Palit, Rohan Pandey, Aryaman Arora et al. — 2023 · 9 citations