Human Attention Mechanisms-inspired Learning for Visual Saliency Perception
人類註意力機製啟發的視覺顯著性感知學習
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 28 Nov 2023 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(20e67fb0-47e9-4a7d-ac9f-713fc63797f3).html |
---|---|
Other link(s) | Links |
Abstract
Visual saliency perception is a crucial cognitive ability that allows humans to quickly detect and prioritize visually distinct objects. One of the key factors that affect this ability is the physical characteristics of objects themselves, including their color, contrast, brightness, semantics, and others. However, human attention mechanisms, such as subitizing, object-based and spatial attentions, and overt and covert attentions, also heavily influence this ability. This thesis aims to investigate the impact of these attention mechanisms on visual saliency perception by employing deep learning techniques to model them in two key tasks: salient instance detection (SID) and saliency ranking (SR).
First, we delve into the role of subitizing in the SID problem. Standard saliency detection methods under weak supervision rely on class labels for object localization. However, relying on class labels to distinguish different salient instances with high semantic affinities can be challenging. We note that subitizing can help separate instances of the same class and group different parts of the same instance by providing a rapid judgment on the number of salient instances. We therefore propose a Weakly-supervised SID Network (WSID-Net) with three branches: a Saliency Detection Branch for locating salient regions using class consistency information; a Boundary Detection Branch for delineating object boundaries through class discrepancy information; and a Centroid Detection Branch for detecting salient instance centroids with subitizing information. These branches complement each other to produce salient instance maps.
Second, we exploit the role of object-based and spatial attentions to the SR problem. Existing methods only model object-based attention mechanism by learning either object-object or object-scene relations, which tend to assign high saliency degrees to strong-semantic objects (e.g., humans). We observe that the human spatial attention mechanism, which would move, engage, and disengage from region to region (i.e., context to context), provides region-level contextual interactions beyond object-level reasoning. Hence, we propose a bi-directional Network (BD-Net) to unify spatial and object-based attentions with two novel modules: (1) a Selective Object Saliency (SOS) module to model object-based attention via adjusting the semantic representation of salient objects, and (2) an Object-Context-Object Relation (OCOR) module to model spatial attention by building second-order object-context interactions.
Third, we examine the significance of overt and covert attentions to the SR problem. Existing methods learn object-oriented context information, which only models the overt attention. While Psychological studies reveal that overt attention occurs after eye fixation for a specific perception of an object, covert attention occurs before and drives the overt eye movements with an ensemble perception of the scene. To this end, we propose a novel Covert-to-Overt PErception Network (COPE-Net) to exploit both overt and covert attention mechanisms with three components: (1) a Covert Contextual Perception (CCP) module to mimic the global viewing process of covert attention; (2) a Conditional Saliency Scaling (CSS) module to imitate the covert-to-overt transition process and the overt attention in a fine-grained way; and (3) a Progressive Alignment (PA) head to unify overt and covert attention mechanisms and regularize their learning consistency with the global saliency in a multi-stage manner.
We have conducted extensive experiments to show that the proposed WSID-Net, BD-Net, and COPE-Net outperform state-of-the-art approaches and carefully-designed baselines on popular SID and SR benchmarks.
First, we delve into the role of subitizing in the SID problem. Standard saliency detection methods under weak supervision rely on class labels for object localization. However, relying on class labels to distinguish different salient instances with high semantic affinities can be challenging. We note that subitizing can help separate instances of the same class and group different parts of the same instance by providing a rapid judgment on the number of salient instances. We therefore propose a Weakly-supervised SID Network (WSID-Net) with three branches: a Saliency Detection Branch for locating salient regions using class consistency information; a Boundary Detection Branch for delineating object boundaries through class discrepancy information; and a Centroid Detection Branch for detecting salient instance centroids with subitizing information. These branches complement each other to produce salient instance maps.
Second, we exploit the role of object-based and spatial attentions to the SR problem. Existing methods only model object-based attention mechanism by learning either object-object or object-scene relations, which tend to assign high saliency degrees to strong-semantic objects (e.g., humans). We observe that the human spatial attention mechanism, which would move, engage, and disengage from region to region (i.e., context to context), provides region-level contextual interactions beyond object-level reasoning. Hence, we propose a bi-directional Network (BD-Net) to unify spatial and object-based attentions with two novel modules: (1) a Selective Object Saliency (SOS) module to model object-based attention via adjusting the semantic representation of salient objects, and (2) an Object-Context-Object Relation (OCOR) module to model spatial attention by building second-order object-context interactions.
Third, we examine the significance of overt and covert attentions to the SR problem. Existing methods learn object-oriented context information, which only models the overt attention. While Psychological studies reveal that overt attention occurs after eye fixation for a specific perception of an object, covert attention occurs before and drives the overt eye movements with an ensemble perception of the scene. To this end, we propose a novel Covert-to-Overt PErception Network (COPE-Net) to exploit both overt and covert attention mechanisms with three components: (1) a Covert Contextual Perception (CCP) module to mimic the global viewing process of covert attention; (2) a Conditional Saliency Scaling (CSS) module to imitate the covert-to-overt transition process and the overt attention in a fine-grained way; and (3) a Progressive Alignment (PA) head to unify overt and covert attention mechanisms and regularize their learning consistency with the global saliency in a multi-stage manner.
We have conducted extensive experiments to show that the proposed WSID-Net, BD-Net, and COPE-Net outperform state-of-the-art approaches and carefully-designed baselines on popular SID and SR benchmarks.