Interactive Video Retrieval

交互式視頻檢索

Student thesis: Doctoral Thesis

View graph of relations

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date20 Aug 2020

Abstract

This thesis investigates the problem of interactive video retrieval from four aspects: (1) developing different versions of the interactive video search system and participating in video search benchmarks from 2017 to 2019; (2) proposing a simulation framework to study the search strategies and the model parameters for color-sketch retrieval, then proposing an efficient color-sketch retrieval model; (3) discovering different techniques for text-based video retrieval; and (4) employing a user study to compare the performances between text-based interactive search and automatic search.

We first focus on the development of our video search system. A video search system is a complex system consisting of different search modules together with a user interface. A user interacts with the system by formulating queries and browsing the result in a loop until locating the desirable video segments. Accordingly, the search system should index multi-modal video content, provide different querying methods, and perform searches in real-time. The interface is expected to be intuitive to users such that the result is self-explanatory and easy to interact. In our video search system development, we consider three essential search modules, including query by sketch, query by text, and query by example. Different information is extracted from videos serving these search modules, consisting of color distribution, visual concept, metadata, on-screen text, speech, and visual feature. We also study conventional interface designs to serve this interactive search system. In three years, from 2017 to 2019, we proposed different approaches and adjustments to improve the search efficiency and user-friendliness of the system. Each of these versions participated in the Video Browser Showdown benchmark and attained competitive results compared to state-of-the-art video search systems.

Investigating color-sketch retrieval approaches, we first employed the color position feature signatures proposed by the SIRET research group. This model has proven to be a practical approach to solve the video retrieval problem, particularly when the user has a visual memory of the target video segment. However, the model has many parameters to be fine-tuned, and these parameters may differ between users. Thoroughly studying this model requires a vast amount of user studies, which is likely impossible to accomplish due to the diversity of users. Hence, we proposed a simulation framework to study better search strategies and better parameters of a color-sketch retrieval model given a video dataset. The simulation study provided some insights into the IACC.3, which is a large-scale original video dataset. Motivated by the simulation study, we followed up to employ an efficient model for color-sketch retrieval. The proposed model has been exhibited in the Video Browser Showdown twice and proven to be beneficial. Further simulation study validated that the functionalities of the proposed model can boost up the search efficiency.

Besides sketch-based retrieval approaches, text-based retrieval is a fundamental search approach for any retrieval system. Textual data for searching may come from manual annotation, e.g., video name, description, or tags provided by the upload user. However, manual annotation is not always consistent, and sometimes is disorganized and even carries unreliable information. On the other hand, automatic annotation is imperfect and limited in narrowing the semantic gap among low-level visual features, user queries, and search context. In this study, we provide two approaches for text-based retrieval with automatic annotation: concept-based approach and multi-model embedding approach. We utilized the benefit of our large concept bank for the concept-based approach and an essential yet effective concept selection method. For the multi-model embedding approach, we proposed three separate models to learn the joint-embedding between video visual feature and object counting, activity, and concept. We further proposed a fusion scheme to merge the search results of these three models. These approaches are evaluated in TRECVID-AVS 2019 and provided interesting insights.

To follow up with the previous study, we run an extensive user study to evaluate the performance difference between automatic search and interactive search. The study focused on text-based retrieval with the main search module relies on the large concept bank of the Vireo video search system. In the user study, we employed two search modes, including search from scratch and searched by inspecting results provided by automatic search. The results were compared with the results of the automatic search, including concept-free and concept-based approaches. The interactions between user and system are logged to analyze the user behavior and its effect on the search result. The comparison of search results shows that automatic search performance is still not as good as an interactive search. Likewise, initiating a search with an automatic search result does not benefit the user than starting a search from scratch. Besides, the recorded interactions provide various understandings of how user composes queries, browses results, and utilizes different search modules.

The proposed systems and retrieval modules were evaluated on large-scale real-world video datasets, including IACC.3 and V3C1. Most of the evaluations have been performed in annual video search benchmarks, including the Video Browser Showdown for interactive search and the Ad-hoc Video Search of TRECVID for automatic search. Experimental results of our techniques show positive potential for real-world interactive video search application.