Abstract
Point cloud analysis has become an essential procedure in autonomous driving, robotics, and other computer vision or graphic applications. To achieve favorable performance, many learning-based methods have been proposed and made continuous progress. However, such algorithms require the construction of large-scale and accurately annotated datasets, which is expensive and time-consuming, especially in the 3D domain. Additionally, the complex and diverse open-world circumstances contain a large number of ‘unseen’ objects or classes, namely, not ever defined and trained by the already deployed 3D systems, which may lead to diverse failures in the pre-trained 3D systems. Existing works attempt to address these issues with a series of zero-shot and few-shot algorithms. However, these methods either achieve limited performance or suffer from low time efficiency. In this thesis, we propose to address these difficulties and promote the zero-shot nd few-shot analysis of point clouds with effective and efficient approaches.This thesis is organized into four parts. In the first part, we propose to transfer the versatile 2D pre-trained model into a powerful 3D zero-shot learner. Currently, the powerful 2D models pre-trained on large-scale multimodality datasets, e.g., Contrastive Language-Image Pre-training (CLIP), have exhibited strong open-world recognition capacity in natural images. In this thesis, we apply CLIP to solve the 3D problems. First, we bridge the 3D-2D modality gap by projecting irregular point clouds into realistic depth maps via a novel projection method. Then, we adopt large-scale language models (LLM) to generate 3D prompts which contain vivid descriptions of the 3D shape. Thus, textual features extracted from our 3D prompts exhibit strong similarity to the projected depth maps, largely preserving and leveraging the prior image-text knowledge of CLIP to 3D domains. Our CLIP-based 3D zero-shot framework can be applied to classification, segmentation, and object detection tasks. This brief framework can also be tailored for few-shot classification tasks. Without ‘seeing’ any 3D training data, this approach significantly enhances the performance of zero-shot classification and segmentation compared to existing works.
In the second part, we continue to explore the potential of CLIP in 3D tasks and propose an efficient few-shot classification model. We design an adaptive prior refinement strategy to refine the prior knowledge of CLIP and filter out the redundant information. After refinement, the refined representation contains informative 3D features which can promote the alignment between visual and textual representations. Meanwhile, we exploit the trilateral relation between the test sample, the training-set images, and the textual prompts to calibrate the final prediction. The proposed model not only improves 3D few-shot performance but also can be applied to 2D image classification and achieves state-of-the-art (SOTA) accuracy.
Although we achieve satisfactory results with the few-shot model proposed above, the training step is time-consuming. Therefore, in the third part, we further design a training-free few-shot classification and segmentation framework to reduce time and resource consumption. The proposed framework adopts positional encodings to represent the absolute and relative position information of points. Then we stack hand-crafted filters to project raw point clouds into embedding space by considering the frequency spectrum of point representations. The whole process embeds point clouds into dense or global representations without introducing any learnable parameters, so it is essentially a training-free encoder for feature extraction. After encoding, we conduct classification or segmentation based on similarity matching. Such a training-free property simplifies the few-shot training pipeline with minimal resource consumption and mitigates the domain gap caused by different training-test categories. The proposed training-free model not only helps few-shot tasks but also achieves satisfactory accuracy on fully supervised tasks. On top of this, we also build a training-based variant to further boost performance through efficient training, which bridges the support and query domain gap with the help of self-correlation and cross-correlation calibration.
Lastly, we turn our attention to the robustness of point cloud models. Most existing algorithms are focused on improving the accuracy number on clear benchmarks while ignoring robustness, resulting in the potential vulnerability of the model under complex and noisy environments. Based on this observation, we aim to improve the robustness of models against common out-of-distribution (OOD) corruptions. To do this, we design an adversarial diffusion autoencoder to generatively pre-train the 3D backbone. Additionally, we argue that the diffusion process of points can be regarded as a data augmentation trick to force the model to learn robust and general features. In summary, the key design of this part is that we extract point cloud representations from the Gaussianized points, and then use them to reconstruct the original point cloud. This process is conducted with a proposed adversarial diffusion autoencoder, which extracts informative and robust representations from point clouds. Different from existing autoencoders, the proposed one repetitively utilizes the encoder to extract features from the reconstructed point clouds. We minimize the coding rate difference between the real samples and the reconstructed point clouds via rate reduction loss, which improves generative ability. Our experiments not only verify the robustness and generalization ability of our autoencoder-based model, but also achieve satisfactory few-shot performance.
Taken together, effective and efficient zero-shot and few-shot analysis frameworks are progressively designed for 3D point clouds, which achieve a better balance between performance and time overhead. We also explore models’ robustness against corruptions in the 3D domain.
| Date of Award | 30 Aug 2024 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Kede MA (Supervisor) |