Exploring and Enhancing Human Pose Transfer for Image and Video Synthesis


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date19 Jun 2023


Image and video synthesis are the computer vision tasks that allow the creation of artificial-intelligence generative content with high-resolution and photo-realistic images and videos in particular task-specific data distribution. They have become a popular research area in recent decades due to the rise of generative adversarial networks (GANs) and high demand for digital content in society. This thesis focuses on a human-centric research field – Human Pose Transfer, which aims at transforming the posture of a human body to other arbitrary poses while preserving the face identity and garment characteristics. It contains tremendous potential multimedia applications such as virtual movie rendering, advertisement synthesis, gaming design, or data augmentation for industrial projects. To equip conditionally controllable capability during the rendering process, the parsing map, a semantic segmentation mask generated by human parsers, is utilized to classify the body parts. There are four major sessions to introduce the parsing map and the evolution of human pose transfer, namely human parsing, image-based human pose transfer, controllable image-based human pose transfer, and video-based human pose transfer.

Firstly, human parsing is presented to deal with the problem assigning a label to corresponding human body parts such as face, right arm, pants, etc. Previous methods made use of semantic context information to extract semantic features in different multi-scale representations. However, the capability of extracting features on occlusion and small-scale objects such as scarf, socks, pants, or glove is needed to be improved. To alleviate these issues, a novel framework – Foreground-Edge-Aware Network (FEANet) is proposed. It is suggested to fuse the foreground and the edge information to segment the occluded regions by reducing the impact of non-human object parts and persevering the boundaries among each class. Moreover, a Dense Atrous Spatial Pyramid Object Context (DenseASPOC) module is introduced to address the difficulty of multi-scale objects by enhancing spatial perception and semantic context.

Secondly, image-based human pose transfer is introduced. Owing to semantic content misalignment and unreliable geometric matching, conventional pose transfer algorithms cannot produce high-fidelity person images. A new GANs-based framework – Spatial Content Alignment GAN (SCA-GAN) is proposed to generalize the input source as both style codes and content information for the purpose of solving the spatial misalignment problem. To deal with the problem of insufficient structural content information, it is proposed to leverage edge map as an extra constraint to strengthen the high-frequency generation.

Thirdly, image-based human pose transfer with controllable attributes is proposed. Due to the lack of deformation of human body shape and garment texture, existing works of controllable pose transfer cannot generate photo-realistic images to edit the person identity and garment texture. A novel end-to-end framework – ShaTure network is proposed to achieve controllable attributes by encoding the segmented human parts instead of the whole image. To enhance the texture fidelity, a new image reconstruction block – ShaTure Block is innovated to decouple human body shape and garment texture in a braiding manner. The shape-and-texture-oriented architecture design can preserve more details of garments and person characteristics. An adaptive Style Selector (AdaSS) module is also developed to focus on segmenting multi-scale objects. It can enhance the feature extraction capability by calibrating the feature map with channel-wise attention.

Last but not least, video-based human pose transfer is presented. Video-based human pose transfer is a video-to-video synthesis task that animates a human image based on a series of target human poses. Considering the difficulties in transferring highly structural patterns on the garments and discontinuous poses, existing methods often generate unsatisfactory results, such as distorted textures and flickering artifacts. A novel Deformable Motion Modulation (DMM) is presented to utilize geometric kernel offset with adaptive weight modulation to simultaneously perform feature alignment and style transfer. To enhance the spatio-temporal consistency, a bidirectional propagation is leveraged to extract the hidden motion information from a warped image sequence generated by noisy poses.

    Research areas

  • Human pose transfer, Image synthesis, Video synthesis, Image processing