VIREO-EURECOM @ TRECVID 2019 : Ad-hoc Video Search (AVS)
Research output: Chapters, Conference Papers, Creative and Literary Works › RGC 32 - Refereed conference paper (with host publication) › peer-review
Author(s)
Detail(s)
Original language | English |
---|---|
Title of host publication | Proceedings of TRECVID 2019 |
Publisher | National Institute of Standards and Technology (NIST) |
Publication status | Published - Nov 2019 |
Publication series
Name | TREC Video Retrieval Evaluation, TRECVID |
---|
Conference
Title | 2019 TREC Video Retrieval Evaluation, TRECVID 2019 |
---|---|
Place | United States |
City | Gaithersburg |
Period | 12 - 13 November 2019 |
Link(s)
Abstract
In this paper, we describe the systems developed for Ad-hoc Video Search (AVS) task at TRECVID 2019[1] and the achieved results.
Ad-Hoc Video Search (AVS): We merge three video search systems for AVS, including: two concept-based video search systems which analyse the query using linguistic approaches then select and fuse the concepts, and a video retrieval model which learns the joint embedding space of the textual queries and the videos for matching. With this setting, we plan to analyze the advantages and shortcomings of these video search approaches. We submit totally seven runs consisting four automatic runs, two manual runs, and one novelty run. We brief our runs as follows:
• F_M_C_D_VIREO.19_1: This automatic run has mean xinfAP=0.034 using a concept-based video search system including ∼16.6k concepts covering objects, persons, activities, and places. We parse the queries with Stanford NLP parsing tool [2], keep the keywords, and categorize the keywords into three groups: object/person, action, and place. Correspondingly, the concepts from different groups in the concept bank are selected and fused.
• F_M_C_D_VIREO.19_2: This automatic run has mean xinfAP=0.067 using the second concept-based video search system including ∼16.4k concepts. The concept bank is slightly different comparing to the concept bank used in F_M_C_D_VIREO.19_1. From the query, we embed the words, terms, and the whole query by the Universal Sentence Embedding [3]. Similarly, we use the same method to embed all the concept names in the concept bank. Finally, the concepts are selected by an incremental concept selection method [4] based on the cosine similarity of the embedded query and embedded concept name.
• F_M_C_D_VIREO.19_3: This run is a fusion of the results from three different automatic runs: F_M_C_D_VIREO.19_1, F_M_C_D_VIREO.19_2, and F_M_C_A_EURECOM.19_1. In the run F_M_C_A_EURECOM.19_1, three embedding spaces are learnt separately for object counting, activity detection, and semantic concept annotation. The query textual feature and the video visual feature are mapped into these three embedding spaces and fused for matching. The run ends up with mean xinfAP=0.060.
• F_M_C_D_VIREO.19_4: This run is a fusion of the results from three automatic runs mentioned in the run F_M_C_D_VIREO.19_3 together with the result of a metadata-based retrieval system. To enable metadata search, we index all the video metadata by Lucene and the retrieval is done in video level. The performance stays at mean xinfAP=0.060.
• M_M_C_D_VIREO.19_1: This manual run uses the same system with the same settings presented in the run F_M_C_D_VIREO.19_1. The difference is that the user parses and categorizes the query manually at the beginning of the process. This human intervention improves the mean xinfAP from 0.034 to 0.066.
• M_M_C_D_VIREO.19_2: This manual run uses the same system with the same settings presented in the run F_M_C_D_VIREO.19_2. After getting the list of selected concepts for each query, the user screens the concept list and remove unrelated or unspecific concepts to refine the result. This step helps improving the mean xinfAP from 0.067 to 0.118.
• F_M_N_D_VIREO.19_5: This is the novelty run with mean xinfAP=0.075, and this is the best automatic run from VIREO team. The system used to process the query is in the same settings with the system presented in the run F_M_C_D_VIREO.19_2 F except that we only use the embedding of the whole query sentence for con cept selection.
Ad-Hoc Video Search (AVS): We merge three video search systems for AVS, including: two concept-based video search systems which analyse the query using linguistic approaches then select and fuse the concepts, and a video retrieval model which learns the joint embedding space of the textual queries and the videos for matching. With this setting, we plan to analyze the advantages and shortcomings of these video search approaches. We submit totally seven runs consisting four automatic runs, two manual runs, and one novelty run. We brief our runs as follows:
• F_M_C_D_VIREO.19_1: This automatic run has mean xinfAP=0.034 using a concept-based video search system including ∼16.6k concepts covering objects, persons, activities, and places. We parse the queries with Stanford NLP parsing tool [2], keep the keywords, and categorize the keywords into three groups: object/person, action, and place. Correspondingly, the concepts from different groups in the concept bank are selected and fused.
• F_M_C_D_VIREO.19_2: This automatic run has mean xinfAP=0.067 using the second concept-based video search system including ∼16.4k concepts. The concept bank is slightly different comparing to the concept bank used in F_M_C_D_VIREO.19_1. From the query, we embed the words, terms, and the whole query by the Universal Sentence Embedding [3]. Similarly, we use the same method to embed all the concept names in the concept bank. Finally, the concepts are selected by an incremental concept selection method [4] based on the cosine similarity of the embedded query and embedded concept name.
• F_M_C_D_VIREO.19_3: This run is a fusion of the results from three different automatic runs: F_M_C_D_VIREO.19_1, F_M_C_D_VIREO.19_2, and F_M_C_A_EURECOM.19_1. In the run F_M_C_A_EURECOM.19_1, three embedding spaces are learnt separately for object counting, activity detection, and semantic concept annotation. The query textual feature and the video visual feature are mapped into these three embedding spaces and fused for matching. The run ends up with mean xinfAP=0.060.
• F_M_C_D_VIREO.19_4: This run is a fusion of the results from three automatic runs mentioned in the run F_M_C_D_VIREO.19_3 together with the result of a metadata-based retrieval system. To enable metadata search, we index all the video metadata by Lucene and the retrieval is done in video level. The performance stays at mean xinfAP=0.060.
• M_M_C_D_VIREO.19_1: This manual run uses the same system with the same settings presented in the run F_M_C_D_VIREO.19_1. The difference is that the user parses and categorizes the query manually at the beginning of the process. This human intervention improves the mean xinfAP from 0.034 to 0.066.
• M_M_C_D_VIREO.19_2: This manual run uses the same system with the same settings presented in the run F_M_C_D_VIREO.19_2. After getting the list of selected concepts for each query, the user screens the concept list and remove unrelated or unspecific concepts to refine the result. This step helps improving the mean xinfAP from 0.067 to 0.118.
• F_M_N_D_VIREO.19_5: This is the novelty run with mean xinfAP=0.075, and this is the best automatic run from VIREO team. The system used to process the query is in the same settings with the system presented in the run F_M_C_D_VIREO.19_2 F except that we only use the embedding of the whole query sentence for con cept selection.
Citation Format(s)
VIREO-EURECOM @ TRECVID 2019: Ad-hoc Video Search (AVS) . / Nguyen, Phuong Anh; Wu, Jiaxin; Ngo, Chong-Wah et al.
Proceedings of TRECVID 2019. National Institute of Standards and Technology (NIST), 2019. (TREC Video Retrieval Evaluation, TRECVID).
Proceedings of TRECVID 2019. National Institute of Standards and Technology (NIST), 2019. (TREC Video Retrieval Evaluation, TRECVID).
Research output: Chapters, Conference Papers, Creative and Literary Works › RGC 32 - Refereed conference paper (with host publication) › peer-review