DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks
Research output: Conference Papers (RGC: 31A, 31B, 32, 33) › 32_Refereed conference paper (no ISBN/ISSN) › peer-review
Author(s)
Related Research Unit(s)
Detail(s)
Original language | English |
---|---|
Publication status | Accepted/In press/Filed - Jun 2023 |
Conference
Title | IEEE/CVF Computer Vision and Pattern Recognition Conference 2023 (CVPR 2023) |
---|---|
Location | Vancouver Convention Center |
Place | Canada |
City | Vancouver |
Period | 18 - 22 June 2023 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/publications/publication(81544d89-ed9e-4b58-a978-dfdeb9a26f46).html |
---|
Abstract
In this paper, we study masked autoencoding (MAE) pre-training on videos for matching-based downstream tasks, including visual object tracking (VOT) and segmentation (VOS). A simple extension of MAE is to randomly mask out frame patches in videos and reconstruct the frame pixels. However, we find that this simple baseline heavily relies on spatial cues while ignoring temporal relations for frame reconstruction, thus leading to sub-optimal temporal matching representations for VOT and VOS. To alleviate this problem, we propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate the temporal correspondence learning in videos. We show that our DropMAE is a strong and efficient temporal matching learner, which achieves better fine-tuning results on matching-based tasks than the ImageNet-based MAE with faster pre-training speed. Moreover, we also find that motion diversity in pre-training videos is more important than scene diversity for improving the performance on VOT and VOS. Our pre-trained DropMAE model can be directly loaded in existing ViT-based trackers for fine-tuning without further modifications. Notably, DropMAE sets new state-of-the-art performance on 8 out of 9 highly competitive video tracking and segmentation datasets. Our code and pre-trained models will be released.
Bibliographic Note
Since this conference is yet to commence, the information for this record is subject to revision.
Citation Format(s)
DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks. / WU, Qiangqiang; YANG, Tianyu; LIU, Ziquan et al.
2023. Paper presented at IEEE/CVF Computer Vision and Pattern Recognition Conference 2023 (CVPR 2023), Vancouver, Canada.Research output: Conference Papers (RGC: 31A, 31B, 32, 33) › 32_Refereed conference paper (no ISBN/ISSN) › peer-review