Text-to-3D Generation and Manipulation with Neural Radiance Field Representation

Project: Research

View graph of relations


Creativity is a defining characteristic of human intelligence but a big challenge for artificial intelligence (AI). Recent advancements in text-to-image generation that produce images from natural language descriptions with unmatched photorealism and astounding imagination have demonstrated the AI's extraordinary creativity. This enchantment is constructed from billions of image-text pairs. Adapting it from images to 3D objects and scenes has enormous practical value, but it requires large-scale labeled 3D data, which does not exist at this time. To circumvent this limitation, we propose text-driven 3D generation and manipulation techniques that leverage deep priors of pre-trained text-image models. Consequently, no 3D data are required, and tremendous computational resources are conserved by reusing pretrained models. These powerful 2D models also guarantee quality and diversity in a 3D generation. The neural radiance field (NeRF) is chosen over mesh or point cloud as the representation of 3D models due to its superior rendering quality and compatibility with neural networks. First, we propose a text-driven NeRF stylization method for transforming existing 3D models into artistic styles described by text prompts. Both color and geometry encoded in the NeRF model will be stylized so as to minimize the contrastive distance between NeRF renderings and the text prompt measured by the contrastive language-image pre-training model (CLIP). Next, we propose a text-driven NeRF generation method for constructing open-set 3D models from scratch using a text-to-image diffusion model that has been pretrained on image data. Multiview frames generated by diffusion models constrain the 3D reconstruction of NeRF, whereas the 3D geometry of NeRF mandates 3D consistency in multi-view image generation. With the text-to-video diffusion prior, we extend it beyond static NeRF generation to dynamic NeRF generation, enabling motions in the generated 3D objects or scenes. Further, we propose a differentiable method to extract textured mesh from NeRF, given that artists may wish to modify generated models, but NeRF's implicit neural representation is difficult to edit. With it, artists can freely edit the geometry and color of NeRF models via mesh proxies. Our previous work in text-driven image editing and NeRF manipulation serves as the foundation for this project. Several building blocks of the proposed methodologies have been shown to be both feasible and effective by our preliminary experiments. This project's research findings will push the frontiers of AI creation and have a significant impact on industries such as gaming, design, film, and the metaverse, which place high demands on 3D modeling and rendering. 


Project number9043528
Grant typeGRF
StatusNot started
Effective start/end date1/01/24 → …