This thesis investigates the problems of multimedia event detection and recounting using
the commonsense knowledge, aiming to 1) automate the selection of concepts in
representing events of interest, 2) boost the event detection performance by exploiting
the commonsense ontology, 3) synthesize sentence by leveraging the knowledge, and 4)
locate the evidence of target events in videos.
The use of concept pool has been common in practice for event detection. However,
learning many irrelevant concepts which are not discriminative in detecting the
target events requires huge computing resources. In view of that, most of the existing
approaches resolve the problem by hand-crafting the event concepts exclusively. This
would be feasible for limited number of events but certainly inapplicable for large-scale
implementation. We propose an event-specific representation, namely event network,
that takes advantages of the commonsense knowledge. Technically, the concepts derived
in the event networks are highly discriminative for the events of interest. Along
with that, we categorize the concepts to make use of the concepts in the event networks
for event detection and recounting.
Classifiers are learned in accordance with the discriminative concepts in the event
networks. The confidence scores from concept detection form a complementary information
in additional to low-level features and yield better performance in event detection.
Apart from that, the concept connectivity is studied and used for context modeling.
Concepts of events tend to covary with other concepts, and inference of contextual
associations of concepts should be done holistically. We thereby devise a novel energy
function and incorporate the co-occurrence information of concepts for the context
modeling. Our model refines the detection scores of event concepts by considering the
co-occurrence of all concepts holistically, in the purpose of boosting the overall event
detection performance.
For videos that are deemed to be positive to an event, evidence of the target events
is to be recounted. Instead of assigning the concept names as the tags along with the
snippets of video, event network is further exploited for event recounting. An automatic
sentence synthesis framework is designed to recount the concepts appear in the videos.
Sentences are automatically generated by making use of the commonsense knowledge
in the event networks. Sentences comply with grammar structures are dynamically synthesized
based on the concepts detected. We show the process from part-of-speech to
parse trees formation, and from phrase generation to full sentence composition. Besides
that, we pilot and showcase the use of commonsense knowledge to compose implied
meaning sentences for event recounting.
It is intuitive to believe the evidence of target events is owned by the clips with the
highest average scores in concept detection. By recounting such clips as the evidence
to an event is indeed insufficient. We investigate the affecting factors in localizing the
evidence from various aspects jointly. A novel ranking model is proposed in the aim to
assign higher scores to the video segments that are believed to have the evidence than
the background segments. We show the breakdown of each setting in responsible to
importance of each aspect to localizing the evidence in videos.
We evaluate all the proposed methods on two large-scale benchmark datasets. Experimental
evaluations demonstrate promising results of our methods in event detection
and recounting by using commonsense knowledge.
Date of Award | 16 Feb 2015 |
---|
Original language | English |
---|
Awarding Institution | - City University of Hong Kong
|
---|
Supervisor | Chong Wah NGO (Supervisor) |
---|
- Multimedia systems
- Knowledge representation (Information theory)
- Digital video
- Information storage and retrieval systems
Video content recounting by commonsense knowledge
TAN, C. C. (Author). 16 Feb 2015
Student thesis: Doctoral Thesis