Abstract
The emergence of deep generative models has revolutionized the creation of visual content, yet a fundamental challenge persists: embedding intuitive, user-centric control into the generation process. While state-of-the-art models can produce visually stunning results, they lack the controllability required to translate specific creative intents into the creative process. This gap is particularly evident in tasks that demand high precision across different modalities, such as reconciling expressive linguistic commands with pixel-accurate image edits, generating diverse yet thematically unified collections of 3D assets, and resolving the inherent geometric ambiguity when creating 3D models from text or image inputs. This thesis addresses these challenges by developing a series of novel multi-modal generative frameworks designed to empower users with fine-grained control over both 2D and 3D visual content generation.First, we introduce LangRecol, a language-based interactive framework that tackles the expressiveness-usability dilemma for photo color adjustment in graphic designs. The core motivation lies in the requirements of precise, context-aware color adjustments (e.g., “recolor the bag to the logo color”) that general editing tools cannot provide. LangRecol uses language instructions as a bridge between design elements and photo regions, addressing three key challenges: color accuracy, multi-granularity instruction parsing, and edit locality. Its two-stage pipeline first employs a granularity-aware module to identify the precise source colors from design elements referenced in the text, followed by a semantic-palette-based recoloring module to apply the extracted colors to the target photo region, ensuring the edit is both localized and semantically coherent.
Second, we propose ThemeStation, a theme-aware 3D-to-3D generation method for producing diverse 3D assets that share consistent themes (semantics and styles) with input 3D exemplar, such as a group of ancient buildings or a monster ecosystem. ThemeStation synthesizes customized 3D assets based on a small set of input 3D exemplars with two goals: 1) unity for generating 3D assets that thematically align with the given exemplars and 2) diversity for generating 3D assets with a high degree of variations. To this end, we design a two-stage framework that follows practical 3D modeling workflows to draw a concept image first, followed by a reference-informed 3D modeling stage. We propose a novel dual score distillation (DSD) loss to jointly leverage priors from both the input exemplars and the synthesized concept image during the 3D modeling.
Finally, we introduce Phidias, a versatile generative method for creating 3D assets from text, image, and 3D conditions with reference-augmented diffusion. By integrating retrieved or user-provided 3D models as geometric guidance, Phidias enhances the generation quality, generalization ability, and controllability via three innovations: 1) meta-ControlNet that dynamically modulates the conditioning strength, 2) dynamic reference routing that mitigates misalignment between the input image and 3D reference, and 3) self-reference augmentations that enable self-supervised training with a progressive curriculum. Extensive experiments show Phidias outperforms existing approaches qualitatively and quantitatively, supporting versatile applications beyond image-to-3D, such as text-to-3D, theme-aware 3D-to-3D, interactive 3D generation with coarse guidance, and high-fidelity 3D completion.
| Date of Award | 16 Jul 2025 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Gerhard Petrus HANCKE (Supervisor) & Rynson W H LAU (Co-supervisor) |