An Empirical Study of Training Self-Supervised Vision Transformers

Xinlei Chen, Saining Xie, Kaiming He

2021 · 2021 IEEE/CVF International Conference on Computer Vision (ICCV) · 1,439 citations

This paper does not describe a novel method. Instead, it studies a straightforward, incremental, yet must-know baseline given the recent progress in computer vision: self-supervised learning for Vision Transformers (ViT). While the training recipes for standard convolutional networks have been highly mature and robust, the recipes for ViT are yet to be built, especially in the self-supervised scenarios where training becomes more challenging. In this work, we go back to basics and investigate the effects of several fundamental components for training self-supervised ViT. We observe that insta…

Read the paper →

Explore this paper's citation graph on Constellation.