3D Point Cloud Learning with Imperfect and Limited Data

基於不完美和有限數據的3D點雲學習

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date29 Aug 2023

Abstract

The 3D perceptions have the potential to revolutionize the way how machines perceive the world and communicate as humans do. In recent years, there have been significant advancements in 3D perceptions, allowing for a variety of applications such as automatic vehicles, navigation robotics, medical imaging, question answering, querying from augmented reality and virtual reality. However, learning robust 3D perception and vision algorithms remains a challenging task due to several factors. One of the major challenges is the acquisition of 3D data and annotation, which is a complex process that requires expertise. The quality of 3D data and annotation is still imperfect, leading to potential errors and inaccuracies in the learning process. Moreover, there is often a limited quantity of available data, making it difficult to train robust models that can generalize well to new and unseen scenarios. In this thesis, we propose techniques to improve 3D vision learning, enabling robust and effective learning from limited and imperfect 3D data.

First, The challenges posed by imperfect 3D data quality can lead to significant performance degradation in downstream tasks. Due to the limitations of 3D sensing technology, 3D point clouds are typically sparse, non-uniform, density-imbalanced, detail-lacking and low-quality. To alleviate this problem, we propose Meta-PU, which is the first to support continuously upsampling point cloud in arbitrary scale. Recent research on point cloud upsampling has achieved great success due to the development of deep learning. However, the existing methods regard point cloud upsampling of different scale factors as independent tasks. Thus, the methods need to train a specific model for each scale factor, which is both inefficient and impractical for storage and computation in real applications. To address this limitation, we propose a novel method called “Meta-PU” to firstly support point cloud upsampling of arbitrary scale factors with a single model. In the Meta-PU method, besides the backbone network consisting of residual graph convolution (RGC) blocks, a meta-subnetwork is learned to adjust the weights of the RGC blocks dynamically, and a farthest sampling block is adopted to sample different numbers of points. Together, these two blocks enable our Meta-PU to continuously upsample the point cloud with arbitrary scale factors by using only a single model. Moreover, our experiments show that training on multiple scales simultaneously benefits each other, resulting in even better performance than existing methods trained for a specific scale factor alone.

Second, it is urgently needed to address the issue of the noisy annotation on 3D data resulting from the complex annotation process. We lead the effort in investigating and finding solutions to this noisy label issue on 3D data. Based on our observations of manual annotation in real-world scenarios and popular datasets, we find that noisy labels on 3D data are unknown, spatially variant, and heavy at both instance and boundary levels. To address this issue, we propose a Point Noise-Adaptive Learning (PNAL) framework. Compared to noise-robust methods on image tasks, our framework is noise-rate blind, to cope with the spatially variant noise rate problem specific to point clouds. Specifically, we propose a point-wise confidence selection to obtain reliable labels from the historical predictions of each point. A cluster-wise label correction is proposed with a voting strategy to generate the best possible label by considering the neighbor correlations. To handle boundary-level label noise, we also propose a variant ”PNAL-boundary” with a progressive boundary label cleaning strategy. Extensive experiments demonstrate its effectiveness on both synthetic and real-world noisy datasets, even with 60% symmetric noise and high-level boundary noise.

Third, the major difficulty in 3D-and-Language perception is the lack of available data and annotation. To address this problem, we present the first attempt at 3D question answering (3DQA), with the first large-scale dataset with natural-language questions and free-form answers in 3D environments that is fully human-annotated. We also use several visualizations and experiments to investigate the astonishing diversity of the collected questions and the significant differences between this task from 2D VQA and 3D captioning. Unlike 2D image VQA, 3DQA takes the color point cloud as input and requires both appearance and 3D geometrical comprehension to answer the 3D-related questions. To this end, we propose a novel transformer-based 3DQA framework ”3DQA-TR”, which consists of two encoders to exploit the appearance and geometry information, respectively. Finally, the multi-modal information about the appearance, geometry, and the linguistic question can attend to each other via a 3D-Linguistic Bert to predict the target answers. Extensive experiments on this dataset demonstrate the obvious superiority of our proposed 3DQA framework over existing VQA frameworks, and the effectiveness of our major designs.