Unsupervised approximate-semantic vocabulary learning for human action and video classification

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

9 Scopus Citations
View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Original languageEnglish
Pages (from-to)1870-1878
Journal / PublicationPattern Recognition Letters
Volume34
Issue number15
Publication statusPublished - 2013

Abstract

The paper presents a novel unsupervised contextual spectral (CSE) framework for human action and video classification. Similar to textual words, the visual word (a mid-level semantic) representation of an image or video contains a combination of synonymous words which give rise to the ambiguity of the representation. To narrow the semantic gap between visual words (mid-level semantic representation) and high-level semantics, we propose a high level representation called approximate-semantic descriptor. The experimental results show that the proposed approach for visual words disambiguation could improve the subsequent classification performance. In the paper, the approximate-semantic descriptor learning is formulated as a spectral clustering problem, such that semantically associated visual words are placed closely in low-dimensional semantic space and then clustered into one approximate- semantic descriptor. Specifically, the high level representation of human action videos is learnt by capturing the inter-video context of mid-level semantics via a non-parametric correlation measure. Experiments on four standard datasets demonstrate that our approach can achieve significantly improved results with respect to the state of the art, particularly for unconstrained environments. © 2013 Elsevier B.V. All rights reserved.

Research Area(s)

  • Contextual spectral embedding, Pearson product moment correlation, Visual vocabulary