Sketch-based 2D & 3D Content Generation with Generative Models

利用生成模型基於草圖的二維和三維內容生成

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date31 May 2024

Abstract

Freehand sketching provides an intuitive and versatile representation to create diverse 2D and 3D content. Deep learning generative models, e.g., Generative Adversarial Network (GAN) and Latent Diffusion Model (LDM), are powerful tools to address the problems of sketch-based 2D and 3D content generation. The learning-based methods often require large-scale datasets of sketches for training. Since it is difficult to collect or construct such datasets of human-drawn sketches, synthetic sketches are often adopted. However, there is a domain gap between synthetic sketches and real ones, significantly influencing the generalization ability of the models trained on synthetic sketches to human-drawn sketches. In this thesis, we explore how to narrow such a gap for sketch-based 2D and 3D content generation, and make the following contributions: the collection and analysis of freehand sketches for improving sketch-based 3D model reconstruction, a GAN-based method for sketch-based hair image synthesis enabled by a dedicated dataset of hair image-sketch pairs, and a diffusion-based method for few-shot sketch-based image synthesis and editing.

We first analyze the gap between synthetic and human-drawn sketches existing in sketch-based 3D model reconstruction. To accomplish this, we collected a moderately large-scale dataset of sketches drawn by humans. Specifically, we invited 70 novice users and 38 expert users to sketch 136 representative 3D objects, which were presented as 362 images rendered from multiple views. This leads to a new dataset of 3,620 freehand multi-view sketches, which are registered with their corresponding 3D objects under certain views. Our dataset is an order of magnitude larger than the existing datasets. We analyze the collected data at three levels, i.e., sketch-level, strokelevel, and pixel-level, under both spatial and temporal characteristics, to compare the human-drawn sketches and the synthetic sketches auto-extracted from the corresponding 3D models. We found that the sketches by the experts are more easily approximated by synthetic drawings than by novices. Thus, we further investigate how differently people with and without adequate drawing skills sketch 3D objects. The drawings by professionals and novices show significant differences at the stroke level, both intrinsically and extrinsically. We demonstrate the usefulness of our dataset in two applications: (i) freehand-style sketch synthesis and (ii) posing it as a potential benchmark for sketch-based 3D reconstruction.

We further study to mitigate the domain gap between synthetic and real sketches for the task of sketch-based hair image synthesis in terms of geometry and appearance. The prior methods automatically compute orientation maps from hair images and then extract hair strokes from the orientation maps. However, the extracted strokes might not faithfully respect the underlying structures of hair images and could not represent the appearance information. We observe that colored human-drawn hair sketches can implicitly define target hair shapes as well as hair appearance and are more flexible in depicting hair structures than those synthesized from orientation maps. Based on these observations, we present SketchHairSalon, a two-stage GAN-based framework for generating realistic hair images directly from freehand sketches depicting desired hair structure and appearance. In the first stage, we train a GAN to predict a hair matte from an input hair sketch with an optional set of non-hair strokes. In the second stage, another GAN is trained to synthesize the structure and appearance of hair images from the input sketch and the generated matte. To make the networks in the two stages aware of the long-term dependency of strokes, we apply self-attention modules to them. To train these networks, we present a new dataset containing thousands of annotated hair sketch-image pairs and corresponding hair mattes. Two efficient methods for sketch completion are proposed to automatically complete repetitive braided parts and hair strokes, respectively, thus reducing the workload of users. Based on the trained networks and the two sketch completion strategies, we build an intuitive interface to allow even novice users to design visually pleasing hair images exhibit ing various hair structures and appearances via freehand sketches. The qualitative and quantitative evaluations show the advantages of the proposed system over the existing or alternative solutions.

The above two works require a large-scale specific dataset and a dedicated method for certain object categories. However, collecting high-quality human-drawn sketch datasets is laborious and high-cost. To address this problem, we study a universal method to enable users to customize their desired sketches using a tiny-scale dataset (1-6 sketch-image pairs) for image synthesis and editing. Personalization techniques for large text-to-image (T2I) models allow users to incorporate new concepts from reference images. This motivates us to explore a novel task of sketch concept extraction: given one or more sketch-image pairs, we aim to extract a special sketch concept that bridges the correspondence between the images and sketches, thus enabling sketch-based image synthesis and editing at a fine-grained level. To accomplish this, we introduce CustomSketching, a two-stage framework for extracting novel sketch concepts. Considering that an object can often be depicted by a contour for general shapes and additional strokes for internal details, we introduce a dual-sketch representation to reduce the inherent ambiguity in sketch depiction. We employ a shape loss and a regularization loss to balance fidelity and editability during optimization. Through extensive experiments, a user study, and several applications, we show our method is effective and superior to the adapted baselines.