Zero-Example Multimedia Search Using Large-Scale Knowledge Bank


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date16 May 2018


This thesis investigates the problem of general-purpose multimedia search without the need of training examples. We tackle the problem by (1) aiming to build a framework that can perform the search with a large-scale pre-trained knowledge bank, (2) inspecting the contributing factors toward both the accuracy and robustness of the system as well as how to exploit the problem-specific knowledge structure to improve the performance, and (3) integrating human knowledge in the loop for interactive search, making the search system adaptable to general-purpose user queries.

We focus on detecting and searching the complex multimedia events. A common practice for this task is to build a pool of concept detectors from a wide range of concepts as prior knowledge. We present a framework that handles a mix of more than ten thousand of concepts with a broad coverage of people, common objects, actions, scenes, and activities. The system can perform a search within seconds in one million videos on a desktop PC after optimization. The query-to-concept matching model in the framework is proposed to flexibly integrate with multiple similarity measurement tuners, such as TF-IDF, word propagation, and word specificity.

For a search system based on domain-unspecific knowledge, we investigate its performance on two folds -- accuracy and robustness. First, the information need from a query regarding the size of the concept bank has the foremost impact. When a concept bank is small, it may miss the information need. In this case, concept propagation can be intuitively applied to discover more related concepts. However, we investigate why concept propagation, in fact, does not work well in our problems. In particular, we learn the drawback of such scheme on complex queries as the propagation is more prone to irrelevant concepts. An alternative exploited to confront the missing information is by increasing the size of the concept bank. We discuss the challenge of augmenting the pool and propose to leverage the advantage by selecting the right and suppressing the wrong concepts while doing the concept matching. Robustness-wise, on the other hand, we propose a novel evidence pre-localization module. The module extends the original framework and narrows the search down to the video segments that are most likely correlated to the query. It not only shows a noise-resistance benefit that by much offsets the downside from the increase of concept number, but also provides an interpretation of the search as "evidence" such that a user can quickly screen a search result through the evidence instead of watching the whole video.

To further improve the overall retrieval performance, we involve human knowledge into the search pipeline by user interactions. We facilitate concept screening and incorporate several video reranking techniques for interactive search. As a consequence, a user can improve a search result either with or without seeing the initial result. Evidence localization also improves user experience by showing the exact locations of the search interest in a video, efficiently enabling a user to iterate through several search results in a short time.

All the proposed techniques are evaluated on large-scale video search benchmarks provided by National Institute of Standards and Technology (NIST). The evaluation demonstrates promising results. We achieved the best performance among world prestigious groups in NIST TRECVID blind tests in 2015 and 2016 respectively. The optimized implementation that fulfills a search within seconds shows good potential for real-world multimedia applications.