Interpretable and Generative Ad-hoc Video Search
可解釋以及生成式的即席視頻檢索
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 25 Mar 2024 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(52460b07-fc92-45f7-8877-c59455945da6).html |
---|---|
Other link(s) | Links |
Abstract
This thesis investigates the problem of ad-hoc video search (AVS) from three perspectives: 1) developing interpretable embedding via dual-task learning, 2) learning consistent and coherent interpretations for cross-modal representations through (un)likelihood training, and 3) enhancing AVS query understanding through valid generations.
The thesis first focuses on developing interpretable embedding for the AVS task. Recently, the concept-free approach, which matches queries and videos in a joint latent space, has outperformed the concept-based approach in video retrieval. Despite its effectiveness, this newer approach yields embedded features and search results that lack interpretability, posing challenges for further video browsing and query reformulation. To address this, we propose a neural network that combines feature embedding with concept interpretation for simultaneous dual-task learning. Specifically, we propose a novel likelihood learning objective to address the label sparsity in the interpretation. Eventually, this approach associates each video feature with a list of semantic concepts, providing a clearer interpretation of the video content. Furthermore, we provide insight into how the semantic concepts complement feature embeddings in video retrieval.
To complement the likelihood training proposed previously in interpreting embedding, we propose a novel unlikelihood training to restrict mutually exclusive concept pairs to be decoded simultaneously. The likelihood training focuses on interpreting the meanings of embeddings beyond the scope of the training labels. Conversely, unlikelihood training utilizes prior knowledge as a regulatory measure to reduce the inconsistency of the interpretations. Incorporating these dual objectives, we propose a new encoder-decoder network, which learns interpretable cross-modal representations for ad-hoc video search. We provide insights on how (un)likelihood training can introduce interest and rare concepts to the interpretation, leading to higher and robust search performance.
In the previous two studies, the interpretable embeddings had problems in dealing with out-of-vocabulary and understanding complex queries intertwined with logical operators or space concepts. To better address these problems, we propose a novel generative video search framework to understand a user query through valid generations. Specifically, capitalizing on the cross-modal generation techniques, three types of generations are proposed to understand the user query: text-to-text, text-to-image, and image-to-text generations. Furthermore, as the generated queries could be irrelevant to the original user query, we propose a QA verification mechanism to maintain their faithfulness by asking textual/visual QA using multi-modal large language models.
Empirical results on the TRECVid AVS benchmarks and text-to-video retrieval datasets show that our proposed interpretable embedding outperforms state-of-the-art concept-free or concept-based text-to-video search methods with a statistically significant margin. The proposed (un)likelihood training can produce consistent and coherent query and video embedding interpretations that further boost the retrieval performance. The generative visual or textual queries effectively address the out-of-vocabulary problem and better understand complex queries. They also complement the original query in video retrieval.
In summary, this thesis addresses the problem of ad-hoc video search with three major contributions: incorporating concept-based and embedding-based searches in a joint framework, developing consistent and coherent interpretations for cross-modal embedding features, and developing a novel generative framework to enhance query understanding.
The thesis first focuses on developing interpretable embedding for the AVS task. Recently, the concept-free approach, which matches queries and videos in a joint latent space, has outperformed the concept-based approach in video retrieval. Despite its effectiveness, this newer approach yields embedded features and search results that lack interpretability, posing challenges for further video browsing and query reformulation. To address this, we propose a neural network that combines feature embedding with concept interpretation for simultaneous dual-task learning. Specifically, we propose a novel likelihood learning objective to address the label sparsity in the interpretation. Eventually, this approach associates each video feature with a list of semantic concepts, providing a clearer interpretation of the video content. Furthermore, we provide insight into how the semantic concepts complement feature embeddings in video retrieval.
To complement the likelihood training proposed previously in interpreting embedding, we propose a novel unlikelihood training to restrict mutually exclusive concept pairs to be decoded simultaneously. The likelihood training focuses on interpreting the meanings of embeddings beyond the scope of the training labels. Conversely, unlikelihood training utilizes prior knowledge as a regulatory measure to reduce the inconsistency of the interpretations. Incorporating these dual objectives, we propose a new encoder-decoder network, which learns interpretable cross-modal representations for ad-hoc video search. We provide insights on how (un)likelihood training can introduce interest and rare concepts to the interpretation, leading to higher and robust search performance.
In the previous two studies, the interpretable embeddings had problems in dealing with out-of-vocabulary and understanding complex queries intertwined with logical operators or space concepts. To better address these problems, we propose a novel generative video search framework to understand a user query through valid generations. Specifically, capitalizing on the cross-modal generation techniques, three types of generations are proposed to understand the user query: text-to-text, text-to-image, and image-to-text generations. Furthermore, as the generated queries could be irrelevant to the original user query, we propose a QA verification mechanism to maintain their faithfulness by asking textual/visual QA using multi-modal large language models.
Empirical results on the TRECVid AVS benchmarks and text-to-video retrieval datasets show that our proposed interpretable embedding outperforms state-of-the-art concept-free or concept-based text-to-video search methods with a statistically significant margin. The proposed (un)likelihood training can produce consistent and coherent query and video embedding interpretations that further boost the retrieval performance. The generative visual or textual queries effectively address the out-of-vocabulary problem and better understand complex queries. They also complement the original query in video retrieval.
In summary, this thesis addresses the problem of ad-hoc video search with three major contributions: incorporating concept-based and embedding-based searches in a joint framework, developing consistent and coherent interpretations for cross-modal embedding features, and developing a novel generative framework to enhance query understanding.
- ad-hoc video retrieval, interpretable embedding, generative search, (un)likelihood training