Revisiting the Integration of Convolution and Attention for Vision Backbone

Lei Zhu, Xinjiang Wang, Wayne Zhang, Rynson Lau*

*Corresponding author for this work

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

3 Citations (Scopus)

Abstract

Convolutions (Convs) and multi-head self-attentions (MHSAs) are typically considered alternatives to each other for building vision backbones. Although some works try to integrate both, they apply the two operators simultaneously at the finest pixel granularity. With Convs responsible for per-pixel feature extraction already, the question is whether we still need to include the heavy MHSAs at such a fine-grained level. In fact, this is the root cause of the scalability issue w.r.t. the input resolution for vision transformers. To address this important problem, we propose in this work to use MSHAs and Convs in parallel at different granularity levels instead. Specifically, in each layer, we use two different ways to represent an image: a fine-grained regular grid and a coarse-grained set of semantic slots. We apply different operations to these two representations: Convs to the grid for local features, and MHSAs to the slots for global features. A pair of fully differentiable soft clustering and dispatching modules is introduced to bridge the grid and set representations, thus enabling local-global fusion. Through extensive experiments on various vision tasks, we empirically verify the potential of the proposed integration scheme, named GLMix: by offloading the burden of fine-grained features to light-weight Convs, it is sufficient to use MHSAs in a few (e.g., 64) semantic slots to match the performance of recent state-of-the-art backbones, while being more efficient. Our visualization results also demonstrate that the soft clustering module produces a meaningful semantic grouping effect with only IN1k classification supervision, which may induce better interpretability and inspire new weakly-supervised semantic segmentation approaches. Code will be available at https://github.com/rayleizhu/GLMix. © 2024 Neural information processing systems foundation. All rights reserved.
Original languageEnglish
Title of host publication38th Conference on Neural Information Processing Systems (NeurIPS 2024)
EditorsA. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, C. Zhang
PublisherNeural Information Processing Systems (NeurIPS)
Pages42941-42964
ISBN (Electronic)9798331314385
Publication statusPublished - Dec 2024
Event38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024) - Vancouver Convention Center, Vancouver, Canada
Duration: 10 Dec 202415 Dec 2024
https://neurips.cc/
https://proceedings.neurips.cc/

Publication series

NameAdvances in Neural Information Processing Systems
Volume37
ISSN (Print)1049-5258

Conference

Conference38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024)
Abbreviated titleNeurIPS 2024
PlaceCanada
CityVancouver
Period10/12/2415/12/24
Internet address

Bibliographical note

Full text of this publication does not contain sufficient affiliation information. With consent from the author(s) concerned, the Research Unit(s) information for this record is based on the existing academic department affiliation of the author(s).

Fingerprint

Dive into the research topics of 'Revisiting the Integration of Convolution and Attention for Vision Backbone'. Together they form a unique fingerprint.

Cite this