Deep Learning-Based Saliency Detection for 3D Scenes


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date16 Aug 2021


With the remarkable popularity of 3D entertainment applications and fast development of network services, a tremendous growth of interest to investigate visual saliency for distinguishing conspicuous regions in 3D scenes. This thesis aims to investigate the saliency detection of 3D scenes including stereoscopic video and light field based upon the advanced deep neural networks. It consists of three main parts: 1) a new deep learning based attention model for stereoscopic videos is developed by singling out the contributions of spatial-temporal and depth cues; 2) a multi-task collaborative network for light field saliency detection is developed by leveraging the collaborative learning of multiple tasks, including edge detection, depth inference and salient object detection; 3) a graph neural networks based light field saliency detection method is developed under the supplementary of multiple angular views. These three parts conduct in-deep research on the saliency detection of 3D scenes, providing useful insights to facilitate the future research of comprehensive attention models for 3D scene saliency detection.

In the first part, we devise a saliency detection model for stereoscopic videos that learns to explore saliency inspired by the component-based interactions including spatial, temporal, as well as depth cues. The model first takes advantage of specific structure of 3D residual network (3D-ResNet) to model the saliency driven by spatial-temporal coherence from consecutive frames. Subsequently, the saliency inferred by implicit-depth is automatically derived based on the displacement correlation between left and right views by leveraging a deep convolutional network (ConvNet). Finally, a component-wise refinement network is devised to produce final saliency maps over time by aggregating saliency distributions obtained from multiple components. In order to further facilitate research towards stereoscopic video saliency, we create a new dataset including 175 stereoscopic video sequences with diverse content, as well as their dense eye fixation annotations. Extensive experiments support that our proposed model can achieve superior performance compared to the state-of-the-art methods on all publicly available eye fixation datasets.

In the second part, based on the intrinsic characteristics of light fields, we carefully explore the complementary coherence among multiple cues including spatial, edge and depth information, and elaborately design a multi-task collaborative network for light field saliency detection. More specifically, the correlation mechanisms among edge detection, depth inference and salient object detection are carefully investigated to facilitate the representative saliency features. We first model the coherence among low-level features and heuristic semantic priors, as well as the edge information. Subsequently, the depth-oriented saliency features are derived from the geometry of light fields, in which the 3D convolution operation is leveraged with powerful representation capability to model the disparity correlations among multiple viewpoint images. Finally, a feature-enhanced salient object generator is developed to integrate these complementary saliency features, leading to the final salient object predictions for light fields. Quantitative and qualitative experiments demonstrate the superiority of our proposed model against the state-of-the-art methods over the public light field salient object detection datasets.

In the third part, we propose a light field saliency detection approach that formulates the geometric coherence among multiple views of light fields as graphs, where the angular/central views represent the nodes and their relations compose the edges. The spatial and disparity correlations between multiple views are effectively explored through multi-scale graph neural networks, enabling the more comprehensive understanding of light field content and more representative and discriminative saliency features generation. Moreover, a multi-scale saliency feature consistency learning module is embedded to enhance the saliency features. Finally, an accurate saliency map is produced for the light field based upon the extracted features. In addition, we establish a new light field saliency detection dataset (CITYU-Lytro) that contains 817 light fields with diverse contents and their corresponding annotations, aiming to further promote the research on light field saliency detection. Quantitative and qualitative experiments demonstrate that the proposed method performs favorably compared with the state-of-the-art methods on the benchmark datasets.