With the massive growth of Internet videos, intensive research efforts have been devoted
to concept annotation, particularly the learning of audio-visual classifiers for annotating
video archives with textual words (or semantic concepts). Based on the current
technologies, indexing a video archive with hundreds of elementary concepts for text-to-video
search in a narrow domain (e.g., for broadcast videos) and with some restrictions
(e.g., only handle queries with text words that are not out of concept vocabulary) is
feasible, as evidenced by the annual TRECVID benchmark evaluation.Despite the progress, querying videos with text words beyond concept atoms, for example
an event of making a cake, in a large Internet video archive remains a difficult problem.
The challenge comes from the fact that the event is generic – a general description that
can refer to wide varieties of cakes and ways of making them, and complex – involving
activities such as flouring, baking, and freezing. Training an event-specific classifier for
“making a cake” is difficult unless there are abundant training examples that adequately
cover all cases; and more importantly it is not scalable with the fact that Internet videos
can contain any possible scenes and events. With most Internet videos uploaded together
with the amateur tags, the difficulty of answering event-oriented queries could be lower.
However, the amateur-tags are used to be error-prone and not specific. Searching
generic events based on textual tags is known to be limited – in the way that it becomes
the users’ responsibilities to explore the long list of returned videos for getting the right
hits.This project aims to address two challenges: event detection – identify videos and
localize the segments containing events that are previously unknown by a search
system; event recounting – narrate the audio-visual evidences of how a video relates to
an event by generating short textual descriptions with illustrative thumbnails. The
former addresses an issue of modeling and reasoning of event knowledge from a large
number of noisily tagged concepts, while the latter explicates the reasoning process to
textual sentences. The major goal is to research techniques that enable the search of
complex and generic events that beyond the current concept classifier learning can
handle, and recount the reasoning process for fast video browsing that beyond the
current video summarization techniques can offer. Both issues are of great value to
Internet video searching and content monitoring.