Adaptive Split-Fusion Transformer
Research output: Chapters, Conference Papers, Creative and Literary Works › RGC 32 - Refereed conference paper (with host publication) › peer-review
Author(s)
Related Research Unit(s)
Detail(s)
Original language | English |
---|---|
Title of host publication | Proceedings - 2023 IEEE International Conference on Multimedia and Expo, ICME 2023 |
Publisher | Institute of Electrical and Electronics Engineers, Inc. |
Pages | 1169-1174 |
ISBN (electronic) | 9781665468916 |
ISBN (print) | 978-1-6654-6892-3 |
Publication status | Published - 2023 |
Publication series
Name | Proceedings - IEEE International Conference on Multimedia and Expo |
---|---|
ISSN (Print) | 1945-7871 |
ISSN (electronic) | 1945-788X |
Conference
Title | 2023 IEEE International Conference on Multimedia and Expo (ICME 2023) |
---|---|
Location | Brisbane Convention and Exhibition Centre |
Place | Australia |
City | Brisbane |
Period | 10 - 14 July 2023 |
Link(s)
Abstract
Neural networks for visual content understanding have recently evolved from convolutional ones to transformers. The prior (CNN) relies on small-windowed kernels to capture the regional clues, demonstrating solid local expressiveness. On the contrary, the latter (transformer) establishes long-range global connections between localities for holistic learning. Inspired by this complementary nature, there is a growing interest in designing hybrid models which utilize both techniques. Current hybrids merely replace convolutions as simple approximations of linear projection or juxtapose a convolution branch with attention without considering the importance of local/global modeling. To tackle this, we propose a new hybrid named Adaptive Split-Fusion Transformer (ASF-former) that treats convolutional and attention branches differently with adaptive weights. Specifically, an ASF-former encoder equally splits feature channels into half to fit dual-path inputs. Then, the outputs of the dual-path are fused with weights calculated from visual cues. We also design a compact convolutional path from a concern of efficiency. Extensive experiments on standard benchmarks show that our ASF-former outperforms its CNN, transformer, and hybrid counterparts in terms of accuracy (83.9% on ImageNet-1K), under similar conditions (12.9G MACs / 56.7M Params, without large-scale pre-training). The code is available at: https://github.com/szx503045266/ASF-former. © 2023 IEEE.
Research Area(s)
- CNN, gating, hybrid, transformer, Visual understanding
Citation Format(s)
Adaptive Split-Fusion Transformer. / Su, Zixuan; Chen, Jingjing; Pang, Lei et al.
Proceedings - 2023 IEEE International Conference on Multimedia and Expo, ICME 2023. Institute of Electrical and Electronics Engineers, Inc., 2023. p. 1169-1174 (Proceedings - IEEE International Conference on Multimedia and Expo).
Proceedings - 2023 IEEE International Conference on Multimedia and Expo, ICME 2023. Institute of Electrical and Electronics Engineers, Inc., 2023. p. 1169-1174 (Proceedings - IEEE International Conference on Multimedia and Expo).
Research output: Chapters, Conference Papers, Creative and Literary Works › RGC 32 - Refereed conference paper (with host publication) › peer-review