Seeing Is Believing: Brain-Inspired Modular Training for Mechanistic Interpretability

Ziming Liu, Eric Gan, Max Tegmark

2023 · Entropy · 22 citations

We introduce Brain-Inspired Modular Training (BIMT), a method for making neural networks more modular and interpretable. Inspired by brains, BIMT embeds neurons in a geometric space and augments the loss function with a cost proportional to the length of each neuron connection. This is inspired by the idea of minimum connection cost in evolutionary biology, but we are the first the combine this idea with training neural networks with gradient descent for interpretability. We demonstrate that BIMT discovers useful modular neural networks for many simple tasks, revealing compositional structure…

Read the paper →

Explore this paper's citation graph on Constellation.