Deep Fine-grained Video Representation for Large-Scale Multimedia Event Retrieval and Recounting

Student thesis: Doctoral Thesis

Abstract

This thesis focuses on learning fine-grained semantical video representation for the multimedia event detection (MED) and recounting (MER) tasks. Given a target event, MED aims to retrieve event-relevant videos, and MER explains the retrieval results by promoting thumbnails as event evidences. We present three fine-grained object-level video representations to handle three critical issues: (1) How to exploit fine-grained regional information and conceptual vector in video representation. (2) How to incorporate unsupervised object-level information with general deep convolutional feature maps for video retrieval. (3) How to utilize supervised learned object-level information for reducing computational cost in constructing video representation. For the first issue, we verify the necessity to consider fine-grained regional objects in constructing semantical video representations. For the second issue, we verify that the necessity can also be broadcasted to general video representations constructed from encoding deep features yielded from the middle layers, such as fully connected and convolutional layers, of DCNN. For the last issue, we aim to improve computational efficiency in constructing video representations which consider fine-grained regional objects. We briefly elaborate our proposed frameworks for the mentioned issues in the following.

Unlike action and sport videos with distinctive motion patterns in short duration, multimedia event relevant videos are complex in motion patterns and object interactions, sharing large visual diversities. Thereby, higher-level feature representation reflecting semantical content or object compositions beyond motions is more feasible to depict rich and diverse content for the multimedia event. Common practices usually represent video frames with semantical tags or feature yielded by deep convolutional neural networks (DCNN), then pool or encode frame-level features for video-level representation. However, due to the employment of max pooling and soft-max layer in DCNN structure, concept detectors tend to overlook activations of small size objects, emphasize those of primary objects and produce a very sparse semantical vector, resulting in the loss of regional object information. To tackle this problem, we propose a fine-grained semantical representation named Object-Pooling which can dynamically extract visual snippets corresponding to the location of when and where evidence might appear. The main idea of object pooling is to adaptively sample regions from frames for the generation of object histogram that can be efficiently rolled up and back. The Object-Pooling video representation demonstrates its effectiveness for both event detection and evidence locating

Despite the encouraging performance gained by Object-Pooling, the conceptual prediction from the output layer still suffers from the flaw of soft-max layer. Particularly, this will result in accumulating noneffective object predictions, when an object proposal fails to be classified into valid conceptual categories. Utilizing mid-layer convolutional features bypasses the side-effect of soft-max layer, however, spatial-temporal pooling strategy adopted in Object-Pooling is not feasible for processing the three-dimensional convolutional feature map. We propose Object-VLAD that adopts vector of locally aggregated descriptor encoding (VLAD) to incorporate deep convolutional feature and fine-grained object information in representing object composition in videos. Specifically, Object-VLAD samples frames and object candidates, representing the primary and secondary object information in a video. Each of them is first described with DCNN mid-layer feature, and then encoded with VLAD into a compact video descriptor for MED. During decoding stage, the descriptors of retrieved videos are unrolled to locate frames and objects as evidences for MER.

Both Object-Pooling and Object-VLAD adopt unsupervised regional proposal algorithms such as ``selective search" for generating object candidates, followed by feeding object candidates into DCNN architecture for feature extraction. This results in the repeated convolutional computations across objects regions within the same frame. To share convolutional computations among object candidates and reduce computational cost, we propose Object-Det-VLAD, which incorporates object detection networks with VLAD quantization method for video representation. Specifically, we adopt region-based fully convolutional networks (R-FCN) for both region of interest (RoI) proposal and RoI feature extraction. Deep features of the same video are then quantized by VLAD into a compact video representation. Since the R-FCN incorporates RoI proposal in learning phase, the proposed RoIs and their activations have bias for object categories in the training set. As a result, from Object-VLAD to Object-Det-VLAD, the performance of event retrieval is slightly sacrificed for computational speed.

Evaluations of the proposed techniques are conducted on a large-scale video retrieval benchmark named TRECVID Multimedia Event Detection (MED) from years 2010 to 2017. These techniques demonstrate promising results, verify the potential and necessity of encapsulating fine-grained object-level video content for effective multimedia event detecting and reasoning.
Date of Award22 Dec 2017
Original languageEnglish
Awarding Institution
  • City University of Hong Kong
SupervisorChong Wah NGO (Supervisor)

Cite this

'