DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks

Qiangqiang Wu, Tianyu Yang*, Ziquan Liu, Baoyuan Wu, Ying Shan, Antoni B. Chan

*Corresponding author for this work

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

109 Citations (Scopus)

Abstract

In this paper, we study masked autoencoder (MAE) pretraining on videos for matching-based downstream tasks, including visual object tracking (VOT) and video object segmentation (VOS). A simple extension of MAE is to randomly mask out frame patches in videos and reconstruct the frame pixels. However, we find that this simple baseline heavily relies on spatial cues while ignoring temporal relations for frame reconstruction, thus leading to sub-optimal temporal matching representations for VOT and VOS. To alleviate this problem, we propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos. We show that our DropMAE is a strong and efficient temporal matching learner, which achieves better finetuning results on matching-based tasks than the ImageNet-based MAE with 2× faster pre-training speed. Moreover, we also find that motion diversity in pre-training videos is more important than scene diversity for improving the performance on VOT and VOS. Our pre-trained DropMAE model can be directly loaded in existing ViT-based trackers for fine-tuning without further modifications. Notably, DropMAE sets new state-of-the-art performance on 8 out of 9 highly competitive video tracking and segmentation datasets. Our code and pre-trained models are available at https://github.com/jimmy-dq/DropMAE.git. © 2023 IEEE.
Original languageEnglish
Title of host publicationProceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
PublisherIEEE
Pages14561-14571
ISBN (Electronic)979-8-3503-0129-8
ISBN (Print)979-8-3503-0130-4
DOIs
Publication statusPublished - 2023
Event2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023) - Vancouver Convention Center, Vancouver, Canada
Duration: 18 Jun 202322 Jun 2023
https://cvpr2023.thecvf.com/Conferences/2023
https://openaccess.thecvf.com/menu
https://ieeexplore.ieee.org/xpl/conhome/1000147/all-proceedings

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN (Print)1063-6919
ISSN (Electronic)2575-7075

Conference

Conference2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2023)
Abbreviated titleCVPR2023
Country/TerritoryCanada
CityVancouver
Period18/06/2322/06/23
Internet address

Bibliographical note

Information for this record is supplemented by the author(s) concerned.

Research Keywords

  • Self-supervised or unsupervised representation learning

Fingerprint

Dive into the research topics of 'DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks'. Together they form a unique fingerprint.

Cite this