Abstract
The human visual system employs sophisticated saliency mechanisms to prioritize attention toward informative regions, enabling efficient scene understanding in complex environments. While critical to daily interactions, image interpretation, and human-computer interaction, computational modeling of these processes remains challenging. Conventional methods often oversimplify human behavior or rely on fragile low-level heuristics, limiting their capacity to replicate human attention mechanisms and adapt to real-world scenarios.This thesis bridges this gap by developing computational frameworks that mimic core mechanisms of human visual saliency. We advance two fundamental saliency tasks, i.e., Salient Object Detection (SOD) and Salient Object Ranking (SOR).
We first study unsupervised salient object detection. Previous unsupervised salient object detection methods usually rely on low-level saliency priors, such as center and background priors, to detect salient objects, resulting in insufficient high-level semantic understanding. We propose to eliminate the dependency on flimsy low-level priors and extract high-level saliency from natural images through a contrastive learning framework. We propose a Contrastive Saliency Network (CSNet), which is a prior-free and label-free saliency detector, with two novel modules: i) a Contrastive Saliency Extraction (CSE) module to extract high-level saliency cues, by mimicking the human attention mechanism within an instance discriminative task through a contrastive learning framework, and ii) a Feature Re-Coordinate (FRC) module to recover spatial details, by calibrating high-level features with low-level features in an unsupervised fashion. In addition, we introduce a novel local appearance triplet (LAT) loss to assist the training process by encouraging similar saliency scores for regions with homogeneous appearances. Extensive experiments show that our approach is effective and outperforms SOTA methods on popular USOD benchmarks.
We then study salient object ranking, a task focusing on predicting the sequential order of human attention shifts among objects in a scene. Existing SOR methods primarily focus on ranking various scene objects simultaneously by exploring their spatial and semantic properties. However, their solutions of simultaneously ranking all salient objects do not align with human viewing behavior and may result in incorrect attention shift predictions. We observe that humans view a scene through a sequential and continuous process involving a cycle of foveating to objects of interest with our foveal vision while using peripheral vision to prepare for the next fixation location. Based on this observation, we propose to model the dynamic interplay between foveal and peripheral vision to predict human attention shifts sequentially. To this end, we propose a novel SOR model, SeqRank, which reproduces foveal vision to extract high-acuity visual features for accurate salient instance segmentation while also modeling peripheral vision to select the object that is likely to grab the viewer’s attention next. By incorporating both types of vision, our model can mimic human viewing behavior better and provide a more faithful ranking among various scene objects. Extensive experiments show that our model achieves superior SOR performance.
We finally study how human pose reflexively guides attention shifts in scenes involving complex human interactions. We observe that human observers' attention can be reflexively guided by the poses and gestures of the people in the scene, which indicate their activities. For example, observers tend to shift their attention to follow others' head orientation or running/walking direction to anticipate what will happen. Inspired by this observation, we propose to exploit human poses in understanding high-level interactions between human participants and their surroundings for robust salient object ranking. Specifically, we propose PoseSOR, a human pose-aware SOR model, with two novel modules: 1) a Pose-Aware Interaction (PAI) module to integrate human pose knowledge into salient object queries for learning high-level interactions, and 2) a Pose-Driven Ranking (PDR) module to apply pose knowledge as directional cues to help predict where human attention will shift to. To our knowledge, our approach is the first to explore human pose for salient object ranking. Extensive experiments demonstrate the effectiveness of our method, particularly in complex scenes, and our model sets the new state-of-the-art on the SOR benchmarks.
| Date of Award | 11 Sept 2025 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Rynson W H LAU (Supervisor) |