Towards Controllable and Efficient Generation of High-Quality Visual Content with Transformers

Project: Research

View graph of relations


Visual content generation has both high research value and great practical significance, such as data construction and augmentation for deep learning, media creation for movies and gaming, and interactive assistance for art and design. Generative adversarial networks (GANs) have been dominating visual content generation for many years, but recently the transformer, the de-facto standard architecture for language tasks, is on the rise in many computer vision tasks and has shown superior potential over GANs for generation tasks in three aspects. First, transformers learn long-range interactions on sequential data, thus enabling better global consistency. Second, transformers with token density prediction naturally support generation diversity. Third, unlike convolutional networks, transformers require minimal inductive biases for their design and thus are more generic to process different types of data. Nonetheless, transformers still face many challenges in replacing GANs, becoming the next generation of popular visual generative models. First, until now, the generation quality of transformers has not been able to compete with the latest GAN models. Second, transformers are data-hungry and computationally intensive, making them inefficient for practical use. Third, existing transformer models provide little user control over generated content, making them less flexible than GAN models. In this project, we propose solving these limitations and making transformers generic and powerful models for visual content generation. To achieve this goal, we design new methods to improve generative transformers in terms of quality, controllability, efficiency, and scalability. 1) For quality, we propose a GAN-enhanced transformer framework that combines the benefits of both GANs and transformers. 2) For controllability, we propose a generic conditional transformer framework that enables controlling the generated results with multi-modal conditions, e.g., texts, images, and 3D models. 3) For efficiency, we propose a dynamic coarse-to-fine transformer framework to adaptively balance quality and speed according to the computational limitations of devices. 4) For scalability, we extend our methods from image generation to a wide range of visual content generation, including videos, vector graphics, and 3D models. Our preliminary experiments that apply transformers to some specific tasks, including image inpainting, image colorization, relighting, and point cloud completion, have already demonstrated the feasibility and advantages of some building blocks of the proposed methodologies. The research results of this project will push the frontiers of artificial intelligence (AI) creation and have considerable impacts on the research communities of computer graphics and computer vision, as well as industries such as entertainment, design, art, e-commerce, etc. 


Project number9043354
Grant typeGRF
Effective start/end date1/01/23 → …