Unsupervised Approaches for 3D Human Pose Tracking in Multi-view RGB and Single-view Depth Videos


Student thesis: Doctoral Thesis

View graph of relations


  • Weichen ZHANG

Related Research Unit(s)


Awarding Institution
Award date5 Nov 2015


3D human pose tracking and estimation is one of the most active and inspiring research areas in computer vision during the last two decades. Regardless in multi-view RGB videos or single-view depth videos, 3D human pose tracking is still a challenging problem. Firstly, the pose configuration is a high-dimensional state space, and joint positions are mapped nonlinearly from pose parameters. Thus, recovering the pose is a nonlinear high-dimensional optimization problem. Secondly, many widely used image features, including edges, silhouettes, HOGs and point clouds are not robust to self-occlusions during human pose tracking, making overlapping limbs hard to localize by these features.
To deal with these main problems, there are 3 types of approaches introduced for human pose tracking and estimation. The first are unsupervised approaches, which are also called generative approaches. The unsupervised methods require a human body model and try to match this model to the observation as well as possible. The pose parameters for the best matched body model are used as the tracking result. These unsupervised methods use various strategies (e.g. Annealing Particle Filter, Monte Carlo Markov Chain) to optimize a nonlinear high-dimensional cost function, or use a twisted exponential pose representation to optimize the cost function linearly. The second type consists of supervised approaches (discriminative approaches) which train regression models to map image features to pose vectors directly. The human body model is not needed for supervised methods, since the predictions are made directly from the image features. Poses with self-occlusions also can be estimated if similar poses are learned during training. Unsupervised methods work for all types of motions and scenes, but their cost functions are hard to optimize, while inference with supervised methods is easy once the model is learned, but can only work with motions and scenes that are similar to the training samples. To combine the advantages of both types of approaches, hybrid approaches were proposed, in which candidate poses are predicted by supervised models and then they are refined by unsupervised optimization.
In this thesis, we only focus on unsupervised approaches, since they do not require a learning process, and hence work for any type of complex motions. In multi-view videos, first, we consider how to improve the likelihood function for unsupervised methods to make them more robust to noises and self-occlusions. We extract edges, silhouettes and color features and combine them using a part-based model to help localize partially-overlapped limbs. To make optimization with the Annealing Particle Filter (APF) smoother, we take an exponential transformation of the likelihood func- tion. With the combination of exponential transformation and part-based model, we achieve state-of-the-art performance among unsupervised approaches. In single-view depth videos, we consider how to avoid converging to local minimums of the ICP cost function caused by noises and self-occlusions. We propose a multiple hypothesis Iterative Closest Point (MHICP) algorithm to localize the limbs, which using a bi-directional matching and a two-way updating strategy to search for most of possible solutions. Since we obtain multiple solutions at each frame, we select some of them by a global score function as the result of the current frame and the initial pose set for the next frame. We achieve performance on par with state-of-the-art unsupervised methods for single-view depth videos.
Existing datasets for 3D human pose tracking and estimation only contain usual daily-life movements, e.g. walking, jogging, gesture and so on. These repeatable movements are simple and easy to be learned by supervised methods. To further improve the robustness of 3D human pose estimation algorithms, we collect a new and much more challenging dataset: the Martial Arts, Dancing and Sports dataset (MADS) for both multi-view videos and single-view depth videos. MADS contains actions from Tai-chi, Karate, jazz dance, hip-hop dance and sports combo (badminton, basketball, football, rugby, tennis and volleyball). These actions have large ranges of movement, faster movement speed, more spinning movements that cause more self-occlusions, which all make MADS challenging for current algorithms. Here we consider depth from a stereo camera rather than Kinect or ToF sensors, because stereo cameras can operate outdoors or in other infrared noisy scenes where Kinect and ToF sensors cannot work. Recent supervised, unsupervised and hybrid approaches are tested on our MADS dataset to obtain the baseline performance.