BiFormer : Vision Transformer with Bi-Level Routing Attention

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

128 Scopus Citations
View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2023
PublisherIEEE
Pages10323-10333
ISBN (electronic)979-8-3503-0129-8
ISBN (print)979-8-3503-0130-4
Publication statusPublished - 2023

Publication series

Name
ISSN (Print)1063-6919
ISSN (electronic)2575-7075

Conference

Title2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023)
LocationVancouver Convention Center
PlaceCanada
CityVancouver
Period18 - 22 June 2023

Abstract

As the core building block of vision transformers, attention is a powerful tool to capture long-range dependency. However, such power comes at a cost: it incurs a huge computation burden and heavy memory footprint as pairwise token interaction across all spatial locations is computed. A series of works attempt to alleviate this problem by introducing handcrafted and content-agnostic sparsity into attention, such as restricting the attention operation to be inside local windows, axial stripes, or dilated windows. In contrast to these approaches, we propose a novel dynamic sparse attention via bi-level routing to enable a more flexible allocation of computations with content awareness. Specifically, for a query, irrelevant key-value pairs are first filtered out at a coarse region level, and then fine-grained token-to-token attention is applied in the union of remaining candidate regions (i.e., routed regions). We provide a simple yet effective implementation of the proposed bilevel routing attention, which utilizes the sparsity to save both computation and memory while involving only GPU-friendly dense matrix multiplications. Built with the proposed bi-level routing attention, a new general vision transformer, named BiFormer, is then presented. As BiFormer attends to a small subset of relevant tokens in a query adaptive manner without distraction from other irrelevant ones, it enjoys both good performance and high computational efficiency, especially in dense prediction tasks. Empirical results across several computer vision tasks such as image classification, object detection, and semantic segmentation verify the effectiveness of our design. © 2023 IEEE

Research Area(s)

  • Deep learning architectures and techniques

Bibliographic Note

Research Unit(s) information for this publication is provided by the author(s) concerned.

Citation Format(s)

BiFormer: Vision Transformer with Bi-Level Routing Attention. / Zhu, Lei; Wang, Xinjiang; Ke, Zhanghan et al.
Proceedings - 2023 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2023. IEEE, 2023. p. 10323-10333.

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review