Human action recognition from videos and 3D motion capture data
Student thesis: Doctoral Thesis
Related Research Unit(s)
As a new research subject in visual understanding domain, human action recognition is a multi-interdisciplinary research subject intersected by image/video processing, computer vision, pattern recognition, statistic learning, and artificial intelligence. It analyzes the content of the raw data (videos, images), extract valuable cues that are discriminative with respect to human actions, and then establish the relationship between the raw data and the high-level semantics. This subject gives rise to many applications including intelligent video surveillance, human-computer interaction, and virtual reality. Due to the wide application prospects and the high theoretical significance, recently a great deal of research has been done on this research subject. In this dissertation, we will investigate the design of automatic human action recognition algorithms for videos and 3D motion capture data. Through reviewing the existing works, we summarize two most important issues in designing human action recognition algorithms: efficient feature representation of human actions, efficient classification methods of human actions. Toward tackling the key issues, our dissertation will propose new methods and insights for the two perspectives including feature learning and classification algorithm. The main contributions of this dissertation are listed as follows. 1. Design of an approximate-semantic visual vocabulary learning framework called contextual spectral embedding (CSE) framework. The approximate-semantic visual vocabulary not only represents the content of the video data efficiently and compactly but also facilitates the subsequent high-level semantic recognition. The CSE framework starts from traditional construction of a redundant visual vocabulary via clustering algorithm, and then optimizes the whole vocabulary by analyzing their semantic relationship with each other. Firstly, the pair-wise semantic similarity is estimated via a non-parametric measure from semantic context, and then undirected graph is adopted to model the semantic relationship within the visual vocabulary. Finally, spectral clustering of graph technique is used to group semantically similar visual words into one approximate-semantic visual word. The feature learning approach can be easily applied to many applications which highly related to visual vocabulary, such as human action recognition and high-level semantic detection. Experiments on four standard datasets demonstrate that our approach can achieve significantly improved results with respect to the state of the art. 2. Design of cross-view human action recognition framework based on transfer learning. In order to overcome the view-dependence of the visual word representation of human action and improve its robustness with respect to changes in viewpoint, we present a novel framework for robust Bilingual Visual Word learning with multi-source constraint propagation (BiVWL+MSCP). The initial semantic similarity between cross-view visual words is estimated from their cooccurrence information, and then refined through multi-source constraint propagation technique. Finally, the view-independent feature learning is also viewed as bipartite graph partitioning problem, the vertices of which denote the visual vocabulary and the edges of which denote the semantic relationship between cross-view visual words. Experiments on multi-view action dataset demonstrate that our approach can achieve satisfied results with respect to changes in viewpoint. 3. Design of a novel classification method based on spatial-temporal hidden markov model, as so to better capture the spatial-temporal dependency within 3D motion capture data simultaneously. Due to high dimensionality of 3D motion capture data, which not only increases computational complexity but also make the characteristic features of action not readily identifiable and detectable, we extend the one-dimensional HMM into spatial-temporal domain. The spatial-temporal hidden markov model exploit the spatial dependency between each pair of spatially connected joints in the articulated skeletal structure, as well as the temporal dependency due to the continuous movement of each of the joints. Experiments on motion capture dataset demonstrate that our classification method gives good performance for recognizing a common set of basic human actions.
- Motion, Digital video, Optical pattern recognition, Computer simulation, Three-dimensional imaging