Matching-based Representation Learning for Object Tracking and Segmentation with Limited Annotations
基於有限標記的目標跟蹤和分割匹配表征學習
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 3 May 2024 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(3143744b-3783-4b8b-8786-0d6c90c46cd6).html |
---|---|
Other link(s) | Links |
Abstract
The learning-to-match paradigm aims to learn a model that can effectively identify and associate target objects in various video frames or static images, which plays a key role in developing various matching-based computer vision tasks such as video object tracking (VOT) and video object segmentation (VOS), and has been widely applied in many surveillance applications. In this thesis, we propose methods for learning representations that are suitable for matching-based VOT and VOS tasks, including: a novel unsupervised temporal representation learning approach for VOT, the progressive unsupervised learning (PUL) approach; a novel generative masked autoencoding pre-training with spatial-attention dropout for tracking tasks, the DropMAE algorithm; a simple yet effective video object segmentation baseline with strong temporal matching ability (SimVOS); a new Crowd counting framework built upon an external momentum template for template matching-guided crowd counting. The main research results are as follows:
1) The existing end-to-end trainable deep trackers learn rich feature representations from large-scale annotated training videos based on supervised learning. However, annotating such large-scale video datasets can be prohibitively expensive and time-consuming. In this thesis, we propose a progressive unsupervised learning (PUL) framework, which entirely removes the need for annotated training videos in visual tracking. The proposed PUL consists of a background discrimination (BD) model that effectively distinguishes an object from background in a contrastive learning way, a progressive temporal mining module to mine temporal corresponding patches and a noise-robust loss function to learn temporal correspondences more effectively from the noisy mined data.
2) To investigate masked autoencoder video pre-training for temporal matching-based downstream tasks, we propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos. We show that our DropMAE is a strong and efficient temporal matching learner, which achieves better fine-tuning results on matching-based tasks than the ImageNet based MAE with $2\times$ faster pre-training speed. Our pre-trained DropMAE model can be directly loaded in existing ViT-based VOT and VOS approaches for fine-tuning without further modifications.
3) The current popular methods for video object segmentation (VOS) implement feature matching through several hand-crafted modules that separately perform feature extraction and matching, which may lead to insufficient target interaction and limited target-aware feature learning. To address these, we present a scalable Simplified VOS (SimVOS) framework to perform joint feature extraction and matching by leveraging a single transformer backbone. SimVOS employs a scalable ViT backbone for simultaneous feature extraction and matching between query and reference features. This design enables SimVOS to learn better target-ware features for accurate mask prediction. Importantly, SimVOS could directly apply well pretrained ViT backbones (e.g., MAE) for VOS, which bridges the gap between VOS and large-scale self-supervised pre-training. To achieve a better performance-speed trade-off, we further explore within-frame attention and propose a new token refinement module to improve the running speed and save computational cost.
4) Inspired by the success of matching learning in temporal perception understanding tasks (e.g., VOT and VOS), we further extend this idea in the crowd counting task. Specifically, to improve the generalization performance of existing crowd counting methods in unseen domains and achieve better zero-shot cross-domain crowd counting, we propose a novel Crowd Counting framework built upon an external Momentum Template, termed C2MoT, which enables the encoding of domain specific information via an external template representation and performing crowd counting via template matching. Specifically, the Momentum Template (MoT) is learned in a momentum updating way during offline training, and then is dynamically updated for each test image in online cross-dataset evaluation. Thanks to the dynamically updated MoT, our C2MoT effectively generates dense target correspondences that explicitly accounts for head regions, and then effectively predicts the density map based on the normalized correspondence map.
Experiments on popular VOT, VOS and crowd counting benchmarks demonstrate that our approaches achieve notable performance, which shows great potential in perception understanding tasks.
1) The existing end-to-end trainable deep trackers learn rich feature representations from large-scale annotated training videos based on supervised learning. However, annotating such large-scale video datasets can be prohibitively expensive and time-consuming. In this thesis, we propose a progressive unsupervised learning (PUL) framework, which entirely removes the need for annotated training videos in visual tracking. The proposed PUL consists of a background discrimination (BD) model that effectively distinguishes an object from background in a contrastive learning way, a progressive temporal mining module to mine temporal corresponding patches and a noise-robust loss function to learn temporal correspondences more effectively from the noisy mined data.
2) To investigate masked autoencoder video pre-training for temporal matching-based downstream tasks, we propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos. We show that our DropMAE is a strong and efficient temporal matching learner, which achieves better fine-tuning results on matching-based tasks than the ImageNet based MAE with $2\times$ faster pre-training speed. Our pre-trained DropMAE model can be directly loaded in existing ViT-based VOT and VOS approaches for fine-tuning without further modifications.
3) The current popular methods for video object segmentation (VOS) implement feature matching through several hand-crafted modules that separately perform feature extraction and matching, which may lead to insufficient target interaction and limited target-aware feature learning. To address these, we present a scalable Simplified VOS (SimVOS) framework to perform joint feature extraction and matching by leveraging a single transformer backbone. SimVOS employs a scalable ViT backbone for simultaneous feature extraction and matching between query and reference features. This design enables SimVOS to learn better target-ware features for accurate mask prediction. Importantly, SimVOS could directly apply well pretrained ViT backbones (e.g., MAE) for VOS, which bridges the gap between VOS and large-scale self-supervised pre-training. To achieve a better performance-speed trade-off, we further explore within-frame attention and propose a new token refinement module to improve the running speed and save computational cost.
4) Inspired by the success of matching learning in temporal perception understanding tasks (e.g., VOT and VOS), we further extend this idea in the crowd counting task. Specifically, to improve the generalization performance of existing crowd counting methods in unseen domains and achieve better zero-shot cross-domain crowd counting, we propose a novel Crowd Counting framework built upon an external Momentum Template, termed C2MoT, which enables the encoding of domain specific information via an external template representation and performing crowd counting via template matching. Specifically, the Momentum Template (MoT) is learned in a momentum updating way during offline training, and then is dynamically updated for each test image in online cross-dataset evaluation. Thanks to the dynamically updated MoT, our C2MoT effectively generates dense target correspondences that explicitly accounts for head regions, and then effectively predicts the density map based on the normalized correspondence map.
Experiments on popular VOT, VOS and crowd counting benchmarks demonstrate that our approaches achieve notable performance, which shows great potential in perception understanding tasks.
- Video Object Tracking, Video Object Segmentation