Abstract
In recent years, 3D human motion analysis and 3D human body shape modeling have received promising attention. The goal of this thesis is to explore various spatial representations in human-related data, including 3D skeleton key-point sequences, 2D RGB videos, and 3D human mesh sequences. Notably, the temporal-based 3D skeleton key points and 2D RGB images in the videos are becoming more and more popular because the correlation and constraint in temporal series are able to effectively improve the 3D human motion analysis and body modeling results. In this way, we use representations with increasingly rich spatial information and propose effective temporal models with appropriate training strategies to improve the video-based human modeling tasks, including human motion prediction, single-person mesh recovery, and multi-person free-view synthesis.We first focus on the problem of motion prediction. Given the observed human 3D skeletal sequences from videos, the objective of human motion prediction is to predict plausible and consecutive future human motion that conveys abundant clues about the person's intention, emotion, and identity. However, predicting plausible future human motion is a very challenging task because of the non-linear and highly spatial-temporal dependencies of human body parts during movements. For more accurate prediction of future human motion, we propose an Adversarial Refinement Network (ARNet) following a simple yet effective coarse-to-fine mechanism with novel adversarial error augmentation. Specifically, we take both the historical motion sequences and coarse prediction as input of our cascaded refinement network to predict refined human motion and strengthen the refinement network with adversarial error augmentation. This adversarial error augmentation provides rich error cases as input to our refinement network, leading to better generalization performance on the testing dataset.
However, the limited spatial information in 3D key-point sequences and the expensive equipment required to collect 3D skeleton data affect the quality and quantity of 3D key-point datasets, respectively, making the datasets widely unusable in daily life. Consequently, we further study the 3D human mesh recovery, which infers the statistic body and pose parameters from the 2D RGB videos. Existing video-based methods lack efficient temporal aggregation and sequence-level motion supervision capabilities. To tackle this problem, we propose a Video2mesh, a temporal convolutional transformer (TConvTransformer) network that recovers accurate and smooth human mesh from 2D videos. The temporal convolution block achieves sequence-level smoothness by aggregating self-improving image features from adjacent frames. The subsequent multi-attention transformer maintains accuracy by relating weights among the aggregated features. An additional adversarial training mechanism further improves the accuracy and smoothness by restricting the pose and shape in a more reliable space based on the AMASS dataset.
Furthermore, the Skinned Multi-person Linear (SMPL) model could only represent the general shape of the human body without clothes, lacking the real shape details and colors. Moreover, this model struggles to obtain such information from single-view RGB images. Therefore, we further study the human free-view synthesis based on sparse-view RGB videos. We also extend the solution to multi-person scenes. Multi-person free-view synthesis aims to generate free-viewpoint videos for dynamic scenes of multiple persons. However, existing methods require intensive views to reconstruct dynamic persons and only achieve good performance on a single person. To reconstruct a multi-person scene with fewer views, especially to solve the occlusion and interaction problems that appear in the multi-person scene, we propose MP-NeRF, a method for multi-person novel view synthesis from sparse cameras without the pre-scanned human model template. Firstly, we apply a multi-person SMPL template as the identity and human motion prior. Then, we create a global latent code to integrate the relative observations among multi-person in a video, enabling the representation of multiple dynamic persons into multiple neural radiance representations from sparse views. Experiments on multi-person dataset MVMP show that our method is superior to other state-of-the-art methods.
Overall, we first investigate the effective temporal model and training algorithm for 3D human motion analysis in sparse key points representation. Then, we expand the human representation from 3D skeleton data to a naked human body model SMPL and use motion sequences as weak supervision to achieve accurate and smooth human body shape modeling. Finally, we apply the SMPL model to improve the photo-realistic details of human body shape and color via free-view synthesis from sparse-view RGB videos for multiple persons.
| Date of Award | 29 Aug 2023 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Wing Ho Howard LEUNG (Supervisor) |