Adaptive Split-Fusion Transformer

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

1 Scopus Citations
View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE International Conference on Multimedia and Expo, ICME 2023
PublisherInstitute of Electrical and Electronics Engineers, Inc.
Pages1169-1174
ISBN (electronic)9781665468916
ISBN (print)978-1-6654-6892-3
Publication statusPublished - 2023

Publication series

NameProceedings - IEEE International Conference on Multimedia and Expo
ISSN (Print)1945-7871
ISSN (electronic)1945-788X

Conference

Title2023 IEEE International Conference on Multimedia and Expo (ICME 2023)
LocationBrisbane Convention and Exhibition Centre
PlaceAustralia
CityBrisbane
Period10 - 14 July 2023

Abstract

Neural networks for visual content understanding have recently evolved from convolutional ones to transformers. The prior (CNN) relies on small-windowed kernels to capture the regional clues, demonstrating solid local expressiveness. On the contrary, the latter (transformer) establishes long-range global connections between localities for holistic learning. Inspired by this complementary nature, there is a growing interest in designing hybrid models which utilize both techniques. Current hybrids merely replace convolutions as simple approximations of linear projection or juxtapose a convolution branch with attention without considering the importance of local/global modeling. To tackle this, we propose a new hybrid named Adaptive Split-Fusion Transformer (ASF-former) that treats convolutional and attention branches differently with adaptive weights. Specifically, an ASF-former encoder equally splits feature channels into half to fit dual-path inputs. Then, the outputs of the dual-path are fused with weights calculated from visual cues. We also design a compact convolutional path from a concern of efficiency. Extensive experiments on standard benchmarks show that our ASF-former outperforms its CNN, transformer, and hybrid counterparts in terms of accuracy (83.9% on ImageNet-1K), under similar conditions (12.9G MACs / 56.7M Params, without large-scale pre-training). The code is available at: https://github.com/szx503045266/ASF-former. © 2023 IEEE.

Research Area(s)

  • CNN, gating, hybrid, transformer, Visual understanding

Citation Format(s)

Adaptive Split-Fusion Transformer. / Su, Zixuan; Chen, Jingjing; Pang, Lei et al.
Proceedings - 2023 IEEE International Conference on Multimedia and Expo, ICME 2023. Institute of Electrical and Electronics Engineers, Inc., 2023. p. 1169-1174 (Proceedings - IEEE International Conference on Multimedia and Expo).

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review