Learning the Cross-Domain and Multi-Modal Representations with Limited Data

在數據受限條件下學習跨域多模態表徵

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date18 Oct 2024

Abstract

Deep learning methods have been extensively developed in recent years, finding applications in various domains such as computer vision and natural language processing. However, the effectiveness of these methods often relies on the availability of large, high-quality datasets, which are not always readily accessible in real-world scenarios. This thesis addresses three distinct problems arising from limited data availability in deep learning applications. 1) Scenarios where data from different domains becomes available gradually, requiring models to adapt without forgetting previously learned information. 2) Scarcity of annotations in target domains, necessitating training on datasets with differing distributions and transferring the learned knowledge to the target domain. 3) Situations where data in the original modality is difficult to obtain, requiring the problem to be transferred to a different modality where data can be more easily collected or generated. Through experiments on incremental object classification, hand tracking in dynamic speed and indoor object-goal navigation, we evaluate and propose novel approaches to overcome these data-related challenges, contributing to the development of more adaptable and efficient deep learning models capable of performing well in resource-constrained environments.

We begin by addressing the challenge of lifelong learning. We constructed an indoor object recognition dataset to more comprehensively evaluate the performance of lifelong learning algorithms across different scenarios. Existing lifelong learning datasets primarily focus on class variations while neglecting the domain shift that occurs within objects of the same category. Our dataset, however, incorporates quantifiable domain-specific factors such as lighting conditions, occlusion, object size, and environmental clutter, thus supporting evaluations in three typical scenarios: domain incremental learning, task incremental learning, and class incremental learning. Based on this dataset, we established corresponding evaluation benchmarks and conducted an in-depth analysis of existing lifelong learning algorithms. This work provides valuable insights for the development of lifelong learning models, advancing the application of lifelong learning in the field of computer vision.

Secondly, we address the challenge of annotation scarcity in target domains, specifically in the task of hand tracking with dynamic speed. Fast motion often results in blurred images of target objects in RGB cameras. To overcome this issue, we employed event cameras with higher temporal resolution. However, annotating 3D hand keypoints during rapid motion remains difficult. To solve this problem, we proposed a novel framework that leverages the knowledge of existing RGB-based hand tracking models to guide event camera-based models in learning hand keypoint recognition during slow motion. Through proposed data transformation, speed-adaptive segmentation, and event-to-frame representation techniques, we enabled the event camera models, trained on slow-motion data, to effectively handle fast-motion data, thus resolving the annotation challenges.

Lastly, we address the challenge of limited data availability in the original modality for object-goal navigation tasks. These tasks require robots to navigate to target objects in unknown environments, typically necessitating training in simulators based on indoor 3D scan data. However, the prohibitive cost of acquiring such data has resulted in limited dataset sizes, leading to models with insufficient generalization capabilities. To address this limitation, we utilized semantic maps as a more accessible data modality for model training and introduced a novel data generation methodology. This approach combines real-world floor plans with semantic maps from existing datasets to automatically generate substantial amounts of training data that adheres to real-world distributions. Experimental results demonstrate that navigation models trained on such synthetically generated data exhibit significantly improved performance.