Human Pose Estimation with Deep Neural Network
基於深層神經網絡的人物姿態估計方法
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 11 Jul 2016 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(d55975ee-3e7e-4d8f-80cd-ad54b8fd4aec).html |
---|---|
Other link(s) | Links |
Abstract
Human pose estimation is an active area in computer vision due to its wide potential applications ranging from motion capture for video games, to action classification for image/video retrieval and surveillance, and to robot-human interactions.
The goal of human pose estimation is to locate the 2D/3D coordinates of body parts (joints). One challenge of pose estimation is to learn a good appearance model that is invariant to pose since the body-part appearance and background vary from image to image. Deep architectures have been shown to be effective in extracting rich and high-level image features in many computer vision tasks. In this thesis, we focus on pose estimation from a RGB monocular image with deep neural networks. Firstly, we propose a heterogeneous multi-task learning framework for 2D human pose estimation from monocular images using a deep convolutional neural network. In particular, we simultaneously learn a human pose regressor and sliding-window body-part and joint-point detectors in a deep network architecture. We show that including the detection tasks helps to regularize the network, directing it to converge to a good solution. We report competitive and state-of-art results on several datasets. We also empirically show that the learned neurons in the middle layer of our network are tuned to localized body parts.
In general, recovering 3D pose from 2D RGB images is considered more difficult than 2D pose estimation, due to the larger 3D pose space, more ambiguities, and the ill-posed problem due to the irreversible perspective projection. We extend our heterogeneous multi-task learning framework for 3D human pose estimation. We train the network using two strategies: 1) a multi-task framework that jointly trains pose regression and body part detectors; 2) a pre-training strategy where the pose regressor is initialized using a network trained for body part detection. We compare our network on a large dataset (Human3.6m dataset) and achieve significant improvement over baseline methods. Due to the dependencies among joint points, the locations of the 3D body parts are highly correlated. Although we do not add constraints about the correlations between body parts to the network, we empirically show that the network has disentangled some of the dependencies among different body parts, and learned their correlations.
As the locations of the 3D body parts are highly correlated and constrained, human pose estimation is also a structured-output task. To explicitly take into account the dependencies among joint-points, we propose a maximum-margin structured learning framework with deep neural networks for estimating whether a given image-pose pair matches with each other. To be specific, our network takes an image and 3D pose as inputs and outputs a score value, which is high when the image-pose pair matches and low otherwise. The network structure consists of a convolutional neural network for image feature extraction, followed by two sub-networks for transforming the image features and pose into a joint embedding. The score function is then the dot-product between the image and pose embeddings. The image-pose embedding and score function are jointly trained using a maximum-margin cost function. Our proposed framework can be interpreted as a special form of structured support vector machines where the joint feature space is discriminatively learned using deep neural networks. We also propose an efficient recurrent neural network based approach for doing inference with learned image-embedding. We test our framework on the Human3.6m dataset and obtain state-of-the-art results compared to other recent methods. Finally, we present visualizations of the image-pose embedding space, demonstrating the network has learned a high-level embedding of body-orientation and pose-configuration.
The goal of human pose estimation is to locate the 2D/3D coordinates of body parts (joints). One challenge of pose estimation is to learn a good appearance model that is invariant to pose since the body-part appearance and background vary from image to image. Deep architectures have been shown to be effective in extracting rich and high-level image features in many computer vision tasks. In this thesis, we focus on pose estimation from a RGB monocular image with deep neural networks. Firstly, we propose a heterogeneous multi-task learning framework for 2D human pose estimation from monocular images using a deep convolutional neural network. In particular, we simultaneously learn a human pose regressor and sliding-window body-part and joint-point detectors in a deep network architecture. We show that including the detection tasks helps to regularize the network, directing it to converge to a good solution. We report competitive and state-of-art results on several datasets. We also empirically show that the learned neurons in the middle layer of our network are tuned to localized body parts.
In general, recovering 3D pose from 2D RGB images is considered more difficult than 2D pose estimation, due to the larger 3D pose space, more ambiguities, and the ill-posed problem due to the irreversible perspective projection. We extend our heterogeneous multi-task learning framework for 3D human pose estimation. We train the network using two strategies: 1) a multi-task framework that jointly trains pose regression and body part detectors; 2) a pre-training strategy where the pose regressor is initialized using a network trained for body part detection. We compare our network on a large dataset (Human3.6m dataset) and achieve significant improvement over baseline methods. Due to the dependencies among joint points, the locations of the 3D body parts are highly correlated. Although we do not add constraints about the correlations between body parts to the network, we empirically show that the network has disentangled some of the dependencies among different body parts, and learned their correlations.
As the locations of the 3D body parts are highly correlated and constrained, human pose estimation is also a structured-output task. To explicitly take into account the dependencies among joint-points, we propose a maximum-margin structured learning framework with deep neural networks for estimating whether a given image-pose pair matches with each other. To be specific, our network takes an image and 3D pose as inputs and outputs a score value, which is high when the image-pose pair matches and low otherwise. The network structure consists of a convolutional neural network for image feature extraction, followed by two sub-networks for transforming the image features and pose into a joint embedding. The score function is then the dot-product between the image and pose embeddings. The image-pose embedding and score function are jointly trained using a maximum-margin cost function. Our proposed framework can be interpreted as a special form of structured support vector machines where the joint feature space is discriminatively learned using deep neural networks. We also propose an efficient recurrent neural network based approach for doing inference with learned image-embedding. We test our framework on the Human3.6m dataset and obtain state-of-the-art results compared to other recent methods. Finally, we present visualizations of the image-pose embedding space, demonstrating the network has learned a high-level embedding of body-orientation and pose-configuration.