A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models

Daking Rai, Yilun Zhou, Feng, Shi et al.

2024 · arXiv (Cornell University) · 11 citations

Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to understand a neural network model by reverse-engineering its internal computations. Recently, MI has garnered significant attention for interpreting transformer-based language models (LMs), resulting in many novel insights yet introducing new challenges. However, there has not been work that comprehensively reviews these insights and challenges, particularly as a guide for newcomers to this field. To fill this gap, we provide a comprehensive survey from a task-centric perspective, organizing the taxon…

Read the paper →

Explore this paper's citation graph on Constellation.