Learning Interactive and Multi-modal Deep Representation: Applications on Computer Vision, Bio-mechanics, and Chemoinformatics

學習交互和多模態深度表征:在計算機視覺、生物力學和化學信息學中的應用

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date26 Aug 2024

Abstract

Reasoning about interactions and relations is central to human intelligence and represents a key objective in the field of artificial intelligence (AI). With the era of big data and advances in AI technologies, we are now able to extract multimodal, high-dimensional data from an extensive array of scenes. This thesis addresses the critical question of how to analyze and study the characteristics of such data, uncover the underlying low-dimensional representations and interaction features, and leverage these findings in various downstream applications, such as robotics, scene analysis, and biomedicine.

From a methodological and modeling perspective, I investigate the notion that neural interaction models based on biological neural networks can often be abstracted into small-world network models. I examine whether similar topological interaction patterns can be applied to artificial neural networks and deep learning (DL) models. Based on this, I have designed the community channel network (CC-Net), embedding the small-world model into convolutional neural networks (CNNs), thereby enhancing neural network performance in representation learning efficiently and adaptively.

I extend this research philosophy to the application of similar interaction models in various multimodal contexts, encompassing macro- and micro-interaction levels. Macro-interactions refer to large-scale interactions primarily involving humans. These interactions occur in real-world scenarios and are explicitly represented through pairs. I studied (1) human-robot interaction. I developed a unified framework that combines interaction learning with multimodal learning. This framework is applied in real-world kitchen scenarios where robots interact with the environment, collecting multimodal data (including sound and video) to discern critical scene information such as the volume of liquid in containers, types of containers, and food categories, potentially aiding in the deployment of home-assistance robots. (2) human object interaction (HOI): I studied HOI in natural images, implementing efficient object detection and interaction categorization during temporal changes in data distribution. This helps overcome model forgetting and potentially aids downstream tasks such as scene understanding and robotics. (3) human-sensor interaction. I designed a deep learning-based temporal method that analyzes a runner’s biomechanics parameters and performance levels by collecting inertial measurement unit (IMU) signals during their run. This analysis helps improve athletic performance and rehabilitation training for different runners. Micro-interactions are concerned with the granularity of chemical compounds, occurring at the molecular or feature level within controlled laboratory settings. My research specifically focused on predicting properties of molecular data. I developed a benchmark to test the performance of various pre-trained graph models under out-of-distribution (OOD) scenarios, such as changes in molecular scaffolds, size, and assays. The findings demonstrate the robustness of pre-trained models in OOD scenarios, outperforming specially designed methods (such as disentangle learning) and offering a streamlined and effective solution for molecular graph prediction in real-world data scenarios.

Looking ahead, this research opens multiple avenues for further exploration and development. The CC-Net’s adaptability and efficiency in enhancing representation learning pave the way for more advanced AI models that closely mimic human neural processing. Applications in human-robot interaction and HOI highlight the potential for more intuitive and responsive AI systems in everyday life, from automated video analysis to advanced robotic assistance. The success in human-sensor interaction analysis and molecular property prediction underscores the versatility of the proposed models in diverse fields, from sports science to pharmaceuticals. The robust performance of pre-trained models in OOD scenarios also suggests a promising future for AI applications in areas where data variability is high. This thesis not only contributes to the understanding of multimodal and interaction representation learning but also sets a foundation for future research that can further integrate these findings into practical, real-world applications.