Towards textually describing complex video contents with audio-visual concept classifiers

Chun Chet Tan, Yu-Gang Jiang, Chong-Wah Ngo

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

32 Citations (Scopus)

Abstract

Automatically generating compact textual descriptions of complex video contents has wide applications. With the recent advancements in automatic audio-visual content recognition, in this paper we explore the technical feasibility of the challenging issue of precisely recounting video contents. Based on cutting-edge automatic recognition techniques, we start from classifying a variety of visual and audio concepts in video contents. According to the classification results, we apply simple rule-based methods to generate textual descriptions of video contents. Results are evaluated by conducting carefully designed user studies. We find that the state-of-the-art visual and audio concept classification, although far from perfect, is able to provide very useful clues indicating what is happening in the videos. Most users involved in the evaluation confirmed the informativeness of our machine-generated descriptions. © 2011 ACM.
Original languageEnglish
Title of host publicationMM'11 - Proceedings of the 2011 ACM Multimedia Conference and Co-Located Workshops
Pages655-658
DOIs
Publication statusPublished - 2011
Event19th ACM International Conference on Multimedia (ACM Multimedia Conference 2011) - Scottsdale, United States
Duration: 28 Nov 20111 Dec 2011
Conference number: 19
https://dl.acm.org/doi/proceedings/10.1145/2072298

Conference

Conference19th ACM International Conference on Multimedia (ACM Multimedia Conference 2011)
Abbreviated titleMM'11
PlaceUnited States
CityScottsdale
Period28/11/111/12/11
Internet address

Research Keywords

  • Audio-visual concept classification
  • Textual descriptions of video content

Fingerprint

Dive into the research topics of 'Towards textually describing complex video contents with audio-visual concept classifiers'. Together they form a unique fingerprint.

Cite this