Video content recounting by commonsense knowledge

  • Chun Chet TAN

Student thesis: Doctoral Thesis

Abstract

This thesis investigates the problems of multimedia event detection and recounting using the commonsense knowledge, aiming to 1) automate the selection of concepts in representing events of interest, 2) boost the event detection performance by exploiting the commonsense ontology, 3) synthesize sentence by leveraging the knowledge, and 4) locate the evidence of target events in videos. The use of concept pool has been common in practice for event detection. However, learning many irrelevant concepts which are not discriminative in detecting the target events requires huge computing resources. In view of that, most of the existing approaches resolve the problem by hand-crafting the event concepts exclusively. This would be feasible for limited number of events but certainly inapplicable for large-scale implementation. We propose an event-specific representation, namely event network, that takes advantages of the commonsense knowledge. Technically, the concepts derived in the event networks are highly discriminative for the events of interest. Along with that, we categorize the concepts to make use of the concepts in the event networks for event detection and recounting. Classifiers are learned in accordance with the discriminative concepts in the event networks. The confidence scores from concept detection form a complementary information in additional to low-level features and yield better performance in event detection. Apart from that, the concept connectivity is studied and used for context modeling. Concepts of events tend to covary with other concepts, and inference of contextual associations of concepts should be done holistically. We thereby devise a novel energy function and incorporate the co-occurrence information of concepts for the context modeling. Our model refines the detection scores of event concepts by considering the co-occurrence of all concepts holistically, in the purpose of boosting the overall event detection performance. For videos that are deemed to be positive to an event, evidence of the target events is to be recounted. Instead of assigning the concept names as the tags along with the snippets of video, event network is further exploited for event recounting. An automatic sentence synthesis framework is designed to recount the concepts appear in the videos. Sentences are automatically generated by making use of the commonsense knowledge in the event networks. Sentences comply with grammar structures are dynamically synthesized based on the concepts detected. We show the process from part-of-speech to parse trees formation, and from phrase generation to full sentence composition. Besides that, we pilot and showcase the use of commonsense knowledge to compose implied meaning sentences for event recounting. It is intuitive to believe the evidence of target events is owned by the clips with the highest average scores in concept detection. By recounting such clips as the evidence to an event is indeed insufficient. We investigate the affecting factors in localizing the evidence from various aspects jointly. A novel ranking model is proposed in the aim to assign higher scores to the video segments that are believed to have the evidence than the background segments. We show the breakdown of each setting in responsible to importance of each aspect to localizing the evidence in videos. We evaluate all the proposed methods on two large-scale benchmark datasets. Experimental evaluations demonstrate promising results of our methods in event detection and recounting by using commonsense knowledge.
Date of Award16 Feb 2015
Original languageEnglish
Awarding Institution
  • City University of Hong Kong
SupervisorChong Wah NGO (Supervisor)

Keywords

  • Multimedia systems
  • Knowledge representation (Information theory)
  • Digital video
  • Information storage and retrieval systems

Cite this

'