Mechanistic interpretability — citation graph of key research

Explore the most influential research on mechanistic interpretability as an interactive citation graph on Constellation. The papers below are connected by direct citations and shared references — open any one to center the graph on it and discover related work.

Top papers on mechanistic interpretability

Progress measures for grokking via mechanistic interpretability
Neel Nanda, Lawrence Chan, Tom Lieberum et al. — 2023 · arXiv (Cornell University) · 54 citations
Explaining AI through mechanistic interpretability
Lena Kästner, Barnaby Crook — 2024 · European Journal for Philosophy of Science · 35 citations
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska, Efstratios Gavves — 2024 · arXiv (Cornell University) · 27 citations
Towards Automated Circuit Discovery for Mechanistic Interpretability
Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch et al. — 2023 · arXiv (Cornell University) · 33 citations
Toward a mechanistic psychology of dialogue
Martin J. Pickering, Simon Garrod — 2004 · Behavioral and Brain Sciences · 2,654 citations
Seeing Is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability
Ziming Liu, Eric Gan, Max Tegmark — 2023 · Entropy · 22 citations
Survey on the Role of Mechanistic Interpretability in Generative AI
Leonardo Ranaldi — 2025 · Big Data and Cognitive Computing · 12 citations
EMT, CSCs, and drug resistance: the mechanistic link and clinical implications
Tsukasa Shibue, Robert A. Weinberg — 2017 · Nature Reviews Clinical Oncology · 2,579 citations
From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models
E. Charles Adams, Liam Bai, Minji Lee et al. — 2025 · bioRxiv (Cold Spring Harbor Laboratory) · 19 citations
Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability
Atticus Geiger, Ibeling, Duligur, Zur, Amir et al. — 2023 · arXiv (Cornell University) · 10 citations
Modeling bioconcentration factor (BCF) using mechanistically interpretable descriptors computed from open source tool “PaDEL-Descriptor”
Subrata Pramanik, Kunal Roy — 2013 · Environmental Science and Pollution Research · 27 citations
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models
Daking Rai, Yilun Zhou, Feng, Shi et al. — 2024 · arXiv (Cornell University) · 11 citations

Constellation — Explore research as a citation graph

Constellation maps the world's scientific literature as an interactive graph. Search any research topic or paste a paper's DOI or arXiv link, and see how works connect through citations and shared references. Discover the sub-fields of an area, tell foundational work from the frontier, follow citation trails, and generate a synthesis of the landscape — across every discipline, powered by OpenAlex.

This application requires JavaScript to run.