Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation
Research output: Chapters, Conference Papers, Creative and Literary Works (RGC: 12, 32, 41, 45) › 32_Refereed conference paper (with host publication) › peer-review
Author(s)
Related Research Unit(s)
Detail(s)
Original language | English |
---|---|
Title of host publication | Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI'2022) |
Publisher | AAAI Press |
Pages | 3380-3389 |
Volume | 36 |
ISBN (Print) | 978-1-57735-876-3 |
Publication status | Published - 30 Jun 2022 |
Conference
Title | 36th AAAI Conference on Artificial Intelligence (AAAI-22) |
---|---|
Location | Virtual |
Period | 22 February - 1 March 2022 |
Link(s)
DOI | DOI |
---|---|
Permanent Link | https://scholars.cityu.edu.hk/en/publications/publication(e5e426d5-932d-403e-84f3-9eca0aaac1ba).html |
Abstract
Spatio-temporal representation learning is critical for video
self-supervised representation. Recent approaches mainly use
contrastive learning and pretext tasks. However, these approaches learn representation by discriminating sampled instances via feature similarity in the latent space while ignoring the intermediate state of the learned representations,
which limits the overall performance. In this work, taking into account the degree of similarity of sampled instances as the
intermediate state, we propose a novel pretext task - spatiotemporal overlap rate (STOR) prediction. It stems from the
observation that humans are capable of discriminating the
overlap rates of videos in space and time. This task encourages the model to discriminate the STOR of two generated
samples to learn the representations. Moreover, we employ
a joint optimization combining pretext tasks with contrastive
learning to further enhance the spatio-temporal representation learning. We also study the mutual influence of each
component in the proposed scheme. Extensive experiments
demonstrate that our proposed STOR task can favor both
contrastive learning and pretext tasks. The joint optimization
scheme can significantly improve the spatio-temporal representation in video understanding. The code is available at
https://github.com/Katou2/CSTP.
Citation Format(s)
Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation. / Zhang, Yujia; Po, Lai-Man; Xu, Xuyuan et al.
Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI'2022). Vol. 36 AAAI Press, 2022. p. 3380-3389.
Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI'2022). Vol. 36 AAAI Press, 2022. p. 3380-3389.
Research output: Chapters, Conference Papers, Creative and Literary Works (RGC: 12, 32, 41, 45) › 32_Refereed conference paper (with host publication) › peer-review