CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Chun-Fu Richard Chen, Quanfu Fan, Rameswar Panda

2021 · 2021 IEEE/CVF International Conference on Computer Vision (ICCV) · 1,927 citations

The recently developed vision transformer (ViT) has achieved promising results on image classification compared to convolutional neural networks. Inspired by this, in this paper, we study how to learn multi-scale feature representations in transformer models for image classification. To this end, we propose a dual-branch transformer to com-bine image patches (i.e., tokens in a transformer) of different sizes to produce stronger image features. Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity and these tokens are then fu…

Read the paper →

Explore this paper's citation graph on Constellation.