Learning of Spatial and Temporal Representations for Human Motion Understanding


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date6 May 2021


Human motion modeling has attracted increasing attention in many real-­world applications. The goal of motion understanding is to learn the representative features from the abundant spatial and temporal information in human movements. Among different types of human activity data, skeleton-­based joint representations become more and more popular because of the robustness towards appearance and background interference. In this thesis, we propose to learn effective spatial-­temporal structures based on human skeletal data for several motion analysis tasks, including retrieval, recognition, and prediction.

We first look at the motion retrieval problem by investigating low­-level hand-crafted features from both spatial and temporal domains to jointly learn a high-level representation to characterize the activity of a single character. In particular, we propose to extract the latent topic distributions from multi­scale motion descriptors and results in a content­-based motion document representation for similarity matching. Previous models always represent actions with single geometric features that fail to describe the diversity in the spatial correlations. In contrast, we consider the relationship between skeletal joints and the correlations between body parts for a more comprehensive representation of human geometry. Our motion document representation also keeps the time information by allocating a specific word range for the learned geometric features in each temporal segmentation, following the topic modeling that maximumly preserves the representative spatial­-temporal descriptors of motion. The experiments on the large­scale motion datasets show the effectiveness and robustness of the proposed motion retrieval method over existing models.

Beyond the hand­-crafted motion features, we have also explored deep learning-based approaches to capture the short and long­-term spatial­-temporal motion correlations for motion prediction. More specifically, we regard the human skeleton as a graph and represent the movements of joint nodes based on graph convolution networks (GCN). We first explore the feasibility of GCN in anticipating future motions. Different from motion retrieval, predicting human dynamics is more complex that requires understanding the deep coherence of past and future motion, where the traditional machine learning algorithms are limited. The hardness of motion prediction lies in the spatial ambiguity (i.e., biased joint relative positions in the estimated pose) and temporal drift (i.e., deflecting to the drift route with error accumulations over time in the predicted movements). We solve these problems by modeling quadruple convolutions on the skeletal graph. Spatially, we first conduct two-­way diffusion convolutions along an adaptive graph with flexible joint connectivity. The graph topology is described within the joints of one and several steps away to capture both short and long-­term spatial dependencies. We then design a bidirectional recurrent predictor to concurrently improve the short and long­-term temporal movements in an adversary manner. Extensive experiments on both 3D and 2D datasets show that the proposed prediction method outperforms the state of the arts. The results also show that our method correctly predicts both high­-dynamic and low-­dynamic moving trends with less motion drift.

Lastly, we explore graph convolution algorithms in a more challenging interaction recognition problem. Compared to motion prediction preserving action­-specific representation, action recognition relies on extracting robust discriminative features between inter­class variations. When more than one person appears in the scene, it also requires to detect the inter­correlations of actions between characters, which is important evidence for crowd behavior analysis. However, existing methods stack the movement features of two characters to deal with human interaction while neglecting the abundant relationships between characters. We propose a novel two­-stream framework by adopting the geometric features from both single actions and interactions to describe the spatial correlations with different discriminative abilities. To understand the inner relationship within characters, we innovatively incorporate pairwise geometric features as auxiliary information to develop the graph topology indicating the joint correlations of the interaction. After spatial modeling, each stream is fed to a bi­directional recurrent network to encode two-­way temporal properties. To take advantage of the diverse discriminative power of the two streams, we also propose a late fusion algorithm to combine their output predictions concerning information entropy. Experimental results demonstrate that the proposed model achieves state-­of-­the-­art recognition performance on benchmark human interaction datasets.

In summary, we investigate the effective spatial and temporal representations for various skeletal motion applications. By innovatively developing human motion with documentary and graph-­based structures, the learned spatial and temporal correlations provide a more comprehensive understanding towards complex motion dynamics.