Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Li Yuan, Yunpeng Chen, Tao Wang et al.

2021 · 2021 IEEE/CVF International Conference on Computer Vision (ICCV) · 2,239 citations

Transformers, which are popular for language modeling, have been explored for solving vision tasks recently, e.g., the Vision Transformer (ViT) for image classification. The ViT model splits each image into a sequence of tokens with fixed length and then applies multiple Transformer layers to model their global relation for classification. However, ViT achieves inferior performance to CNNs when trained from scratch on a midsize dataset like ImageNet. We find it is because: 1) the simple tokenization of input images fails to model the important local structure such as edges and lines among nei…

Read the paper →

Explore this paper's citation graph on Constellation.