Temporal Grounding for Video Moment Retrieval
基於時序定位的視頻片段檢索
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 4 Sept 2023 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(d2ae04b7-1cf0-4595-8c65-054dc595ccb6).html |
---|---|
Other link(s) | Links |
Abstract
The thesis investigates the problem of video moment retrieval from the perspective of temporal grounding, which aims to localize the temporal segment relevant to a user query, typically a natural language query, inside an untrimmed video or a video corpus. We systematically explore this important and challenging problem from three aspects: (1) how to effectively tackle multi-modal video content; (2) how to efficiently tackle long-form video input; 3) how to flexibly tackle free-form query input formats.
We first tackle temporal grounding in the setting of multi-modal video content, which includes both visual frames and textual automatic speech transcript. Existing approaches have yet to fully explore and jointly explore the early fusion of query context and video content. Motivated by this, we propose a novel architecture CONQUER, which explores query context for multi-modal fusion and representation learning in two different steps. Technically, the first step derives fusion weights for the adaptive combination of multi-modal video content. The second step performs bi-directional attention to tightly couple video and query as a single joint representation for moment localization. As query context is fully engaged in video representation learning, from feature fusion to transformation, the resulting feature is user-centered and has a larger capacity in capturing multi-modal signals specific to the query.
We then tackle temporal grounding in the setting of long-form video input, which brings two challenges in high inference computation cost and weak multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. Technically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. With this proposed coarse-to-fine design, CONE has the advantages of both higher efficiency and more accurate multi-modal alignment.
Finally, we tackle temporal grounding in the setting of free-form query input formats. Existing approaches design specific model architecture for each type of query and are trained separately, which leads to the infeasibility of sharing knowledge between tasks and limited generalization ability. Motivated by this, we propose a unified model GroundFormer to effectively unify the query formats and output distributions. Technically, GroundFormer unifies the query as a textual token sequence and regards the visual image as a special visual token, and further leverages the multi-scale pyramid feature learning to adapt various output distributions. Moreover, we propose a two-stage transfer learning with model pre-training and multi-task training to gradually enhance the generalization ability. With this proposed unified model and training strategy, the resulting model acquires a more holistic and versatile capability for video grounding than its single-task counterparts.
We evaluate the proposed techniques on several large-scale real-world video datasets, including DiDeMo, TVR, MAD and Ego4D. Experimental evaluations demonstrate promising results of the proposed techniques, which shed light on various real-world multi-modal video applications, such as memory search in argument reality, object search in robotics, and search inside a web video. In summary, the main contribution of this thesis is to expand the flexibility of both video content and query format towards a more practical video retrieval application, along with the proposed novel approaches to achieve this vision.
We first tackle temporal grounding in the setting of multi-modal video content, which includes both visual frames and textual automatic speech transcript. Existing approaches have yet to fully explore and jointly explore the early fusion of query context and video content. Motivated by this, we propose a novel architecture CONQUER, which explores query context for multi-modal fusion and representation learning in two different steps. Technically, the first step derives fusion weights for the adaptive combination of multi-modal video content. The second step performs bi-directional attention to tightly couple video and query as a single joint representation for moment localization. As query context is fully engaged in video representation learning, from feature fusion to transformation, the resulting feature is user-centered and has a larger capacity in capturing multi-modal signals specific to the query.
We then tackle temporal grounding in the setting of long-form video input, which brings two challenges in high inference computation cost and weak multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. Technically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. With this proposed coarse-to-fine design, CONE has the advantages of both higher efficiency and more accurate multi-modal alignment.
Finally, we tackle temporal grounding in the setting of free-form query input formats. Existing approaches design specific model architecture for each type of query and are trained separately, which leads to the infeasibility of sharing knowledge between tasks and limited generalization ability. Motivated by this, we propose a unified model GroundFormer to effectively unify the query formats and output distributions. Technically, GroundFormer unifies the query as a textual token sequence and regards the visual image as a special visual token, and further leverages the multi-scale pyramid feature learning to adapt various output distributions. Moreover, we propose a two-stage transfer learning with model pre-training and multi-task training to gradually enhance the generalization ability. With this proposed unified model and training strategy, the resulting model acquires a more holistic and versatile capability for video grounding than its single-task counterparts.
We evaluate the proposed techniques on several large-scale real-world video datasets, including DiDeMo, TVR, MAD and Ego4D. Experimental evaluations demonstrate promising results of the proposed techniques, which shed light on various real-world multi-modal video applications, such as memory search in argument reality, object search in robotics, and search inside a web video. In summary, the main contribution of this thesis is to expand the flexibility of both video content and query format towards a more practical video retrieval application, along with the proposed novel approaches to achieve this vision.