Towards Graph-based Learning for Generalizable Object Detection
基於圖學習的可範化目標檢測
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 11 Jan 2024 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(74bd66eb-0ea0-4e85-8fe8-95acd4e2b413).html |
---|---|
Other link(s) | Links |
Abstract
The world is witnessing rapid advancements in computer vision. Object detection, one of the most critical and challenging technologies, aims to classify and locate object instances within images, which has significantly contributed to a wide range of real-world applications, including autonomous driving, precise medicine, etc. As civilization progresses, there is an increasingly high demand for more intelligent algorithms to improve human lives, emphasizing the need for automated assistance, such as driving and medical diagnosis, in various real-world scenarios.
While object detection has made significant progress in benchmarked datasets, its generalization to real-world scenarios remains a substantial challenge due to the large gap between the researched ideal conditions and complex real-world situations. In practice, it is inevitable to encounter diverse environments and unfamiliar objects not covered by the training data, posing two challenges hindering reliable detection. First, implementing in a novel-scene scenario with inconsistent data distribution will lead to significant performance drops. Second, the encountered novel-class objects can confuse the model and lead to incorrect decision-making with substantial potential risks.
To ensure reliable real-world generalization, we have identified that both the challenges of novel scenes and novel classes can be broadly formulated as a base-to-novel transfer problem. To address this, this thesis proposes a series of graph-based solutions to uncover the inherent relationship between base and novel knowledge, enabling effective base-to-novel transfer and contributing to developing trustworthy systems that can effectively assist humans.
In the first part, we explore the ideal learning condition with full annotation and propose a relation-aware object detector, Heterogeneous Task Decoupling (HTD), which decouples the classification and localization to meet inconsistent feature requirements. HTD introduces a new progressive graph to conduct local-to-global graph reasoning for discriminative semantic learning and a border enhancement mechanism to enhance offset prediction with boundary perception. Consequently, HTD is a powerful generic pipeline and achieves state-of-the-art performance on benchmarks.
In the second part, we explore the adaptation to the novel scene and propose a conditional graph to align distribution with semantic-level consistency. We are the first to empirically explore the key factor leading to the performance drop and observe the critical role of class misalignment. Then, we model the unbiased semantic representation in different domains with a semantic-conditioned graph structure, which is used to adapt the object detector through optimal transport theory.
In the third part, we further delve into the novel-scene adaptation and propose a graph-matching-based framework to achieve a fine-grained adaptation. We first establish a graph by sampling fine-grained pixels in each domain to represent the semantic-level distribution and then complete the mismatched classes by generating hallucination nodes. Finally, we formulate the domain adaptation problem via graph matching to align the cross-domain distribution with a matching loss. The aligned distribution enables the adaptation of novel scenes with good robustness.
In the fourth part, we explore novel-class identification and propose a generative framework with probabilistic graphical modeling, which addresses the base-class overfitting in existing discriminative pipelines. Specifically, we first establish a probabilistic space for object embedding to enable unbiased learning. Then, we formulate a continual distribution transfer from the object to the text with denoising diffusion, aligning the visual and the text space with novel-class awareness.
In the fifth part, we explore real-world generalization by considering the novel scene and class together and delve into the image classification sub-task with a causal graph. Considering numerous spurious correlations hindering unbiased learning, we first establish a structural graph model to analyze the key factor in real-world generalization and then design a theoretically grounded framework to change the theory into practice, which conducts causal intervention techniques to identify novel class and introduces decoupled causal alignment to address the cross-domain challenge.
In the last part, we are the first to formulate and study the real-world generalization in object detection and propose a graph-motif-based framework to capture the high-order dependency. As inherent high-order relation is hidden in the real world, we establish the graph to model the relation candidate and then select graph motifs, i.e., the statistically significant subgraph, to extract informative high-order patterns. After that, we propose a unified optimization procedure with different and task-friendly graph motifs to learn with the novel class and scene.
While object detection has made significant progress in benchmarked datasets, its generalization to real-world scenarios remains a substantial challenge due to the large gap between the researched ideal conditions and complex real-world situations. In practice, it is inevitable to encounter diverse environments and unfamiliar objects not covered by the training data, posing two challenges hindering reliable detection. First, implementing in a novel-scene scenario with inconsistent data distribution will lead to significant performance drops. Second, the encountered novel-class objects can confuse the model and lead to incorrect decision-making with substantial potential risks.
To ensure reliable real-world generalization, we have identified that both the challenges of novel scenes and novel classes can be broadly formulated as a base-to-novel transfer problem. To address this, this thesis proposes a series of graph-based solutions to uncover the inherent relationship between base and novel knowledge, enabling effective base-to-novel transfer and contributing to developing trustworthy systems that can effectively assist humans.
In the first part, we explore the ideal learning condition with full annotation and propose a relation-aware object detector, Heterogeneous Task Decoupling (HTD), which decouples the classification and localization to meet inconsistent feature requirements. HTD introduces a new progressive graph to conduct local-to-global graph reasoning for discriminative semantic learning and a border enhancement mechanism to enhance offset prediction with boundary perception. Consequently, HTD is a powerful generic pipeline and achieves state-of-the-art performance on benchmarks.
In the second part, we explore the adaptation to the novel scene and propose a conditional graph to align distribution with semantic-level consistency. We are the first to empirically explore the key factor leading to the performance drop and observe the critical role of class misalignment. Then, we model the unbiased semantic representation in different domains with a semantic-conditioned graph structure, which is used to adapt the object detector through optimal transport theory.
In the third part, we further delve into the novel-scene adaptation and propose a graph-matching-based framework to achieve a fine-grained adaptation. We first establish a graph by sampling fine-grained pixels in each domain to represent the semantic-level distribution and then complete the mismatched classes by generating hallucination nodes. Finally, we formulate the domain adaptation problem via graph matching to align the cross-domain distribution with a matching loss. The aligned distribution enables the adaptation of novel scenes with good robustness.
In the fourth part, we explore novel-class identification and propose a generative framework with probabilistic graphical modeling, which addresses the base-class overfitting in existing discriminative pipelines. Specifically, we first establish a probabilistic space for object embedding to enable unbiased learning. Then, we formulate a continual distribution transfer from the object to the text with denoising diffusion, aligning the visual and the text space with novel-class awareness.
In the fifth part, we explore real-world generalization by considering the novel scene and class together and delve into the image classification sub-task with a causal graph. Considering numerous spurious correlations hindering unbiased learning, we first establish a structural graph model to analyze the key factor in real-world generalization and then design a theoretically grounded framework to change the theory into practice, which conducts causal intervention techniques to identify novel class and introduces decoupled causal alignment to address the cross-domain challenge.
In the last part, we are the first to formulate and study the real-world generalization in object detection and propose a graph-motif-based framework to capture the high-order dependency. As inherent high-order relation is hidden in the real world, we establish the graph to model the relation candidate and then select graph motifs, i.e., the statistically significant subgraph, to extract informative high-order patterns. After that, we propose a unified optimization procedure with different and task-friendly graph motifs to learn with the novel class and scene.