Abstract
Human-related content creation is a fundamental area of research in computer graphics, bridging the gap between virtual and real-world humans. Traditional computer graphics pipelines for human generation are time-consuming, labor-intensive, and require specialized expertise, making them inaccessible to non-experts. While generative models have automated and simplified the creation process, they still face challenges such as reliance on large datasets, low generalizability in specific domains, and limited fine-grained control, which restrict their practical applications.In this thesis, we explore how to achieve fine-grained control for human-related content generation across artistic, static, and temporal domains. Our contributions are threefold: (1) Pose and expression transfer for artistic face generation; (2) Disentangled geometry and appearance control for static human image synthesis; and (3) Geometry, appearance, and motion control for coherent human video generation.
We first investigate cross-domain controllability by transferring poses/expressions from human face videos to artistic face images. Unlike real human faces, artistic human faces (e.g., those in paintings, cartoons, etc.) often involve exaggerated shapes and various textures. Therefore, directly applying existing solutions to artistic faces often fails to preserve the characteristics of the original artistic faces (e.g., face identity and decorative lines along face contours) due to the domain gap between real and artistic faces. To address these issues, we present ReenactArtFace, which achieves artistic face reenactment by combining the 3D prior model and generative model. We begin by reconstructing a textured 3D semantic artistic face using a 3D Morphable Model (3DMM) and a 2D parsing map derived from the input artistic image. The rigged 3DMM enables the generation of reenacted parsing map videos, thereby ensuring geometry preservation. Next, we utilize a personalized conditional adversarial generative model (cGAN) to enhance texture preservation and synthesize contour lines, guided by the coarse reenactment results rendered from the 3DMM and incorporating a novel contour loss. Both quantitative and qualitative evaluations demonstrate that ReenactArtFace surpasses existing reenactment methods, effectively preserving artistic identity and intricate details.
3D parametric human models, effectively provide basic geometric controls (e.g., for expressions and poses). However, they are limited in offering flexible and local geometric control, especially for more complex scenarios like full-body dressed humans. Thus, we want to explore a more accessible and intuitive approach to achieve disentangled control over both geometry and appearance for full-body human image generation. Sketching offers intuitive geometry editing capabilities and has been widely applied in sketch-based face generation and editing. However, directly extending sketch-based methods to full-body generation often fails to produce high-fidelity and diverse results due to the complexity and variability in poses, body shapes, and garment textures. Recent diffusion-based methods with geometric controllability primarily rely on prompts but struggle to balance realism and adherence to input sketches when the inputs are coarse. To address these challenges, we propose Sketch2Human, which combines semantic sketches (for geometry control) and reference images (for appearance control) based on the latent space of the unconditional human generator, StyleGAN-Human. StyleGAN-Human not only guarantees high-quality outputs but also offers an interpretable latent space, enabling basic control through inversion encoders. To enhance robustness and accuracy in geometry inversion, we propose directly supervising the sketch encoder in the geometry domain rather than in the image domain. Given the entangled nature of geometry and texture in StyleGAN-Human, we have designed a novel training scheme that generates training data that preserves geometry while transferring appearance, enabling the generator to achieve fully disentangled control. Despite being trained on synthetic data, Sketch2Human effectively handles hand-drawn sketches, producing high-fidelity and diverse results. Extensive qualitative and quantitative evaluations demonstrate the superiority of our method over state-of-the-art approaches, achieving both realism and controllability.
Finally, we extend the above disentangled geometry and appearance control into the temporal domain for human video synthesis. Recent advancements in human fashion video generation with diffusion models have transformed the field, producing various promising effects. Existing methods mainly focus on pose control but lack the ability to achieve sketch-based control. Drawing insights from Sketch2Human, we attribute this gap to the absence of appearance-consistent and shape-varying knowledge in existing human video datasets. Moreover, the necessity of sequential structure inputs to control video generation hinders real-world applications. We propose Sketch2HumanVideo, achieving human video generation under three conditions: temporally sparse sketches, a spatially sparse pose sequence, and a reference appearance image. Our key contribution is a sparse sketch encoder, which takes the first two conditions as input, enabling precise and multi-view control of shape motion. To provide the above knowledge, we leverage the expertise of two pretrained models to synthesize a dataset comprising shape-varying yet appearance-consistent examples for model training. Furthermore, we introduce an enlarging-and-resampling scheme to enhance high-frequency details of local regions in resource-constrained scenarios, thereby promoting the generation of realistic videos. Through qualitative and quantitative experiments, our method showcases superior performance to state-of-the-art approaches and flexible control.
| Date of Award | 28 Aug 2025 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Miu Ling LAM (Supervisor) & Hongbo Fu (External Co-Supervisor) |