Cross-modality Fusion and Progressive Integration Network for Saliency Prediction on Stereoscopic 3D Images

Research output: Journal Publications and Reviews (RGC: 21, 22, 62)21_Publication in refereed journalpeer-review

View graph of relations


  • Yudong Mao
  • Qiuping Jiang
  • Runmin Cong
  • Wei Gao
  • Feng Shao

Related Research Unit(s)


Original languageEnglish
Journal / PublicationIEEE Transactions on Multimedia
Online published19 May 2021
Publication statusOnline published - 19 May 2021


Traditional 2D image-based saliency prediction models suffer from unsatisfactory performance when dealing with stereoscopic 3D (S3D) images because eye movements in the case of freely viewing S3D images are demonstrated to be guided by both RGB and depth features. This paper studies the problem of saliency prediction on S3D images, where the interactions between RGB and depth modalities are both taken into account. Specifically, we design a novel deep neural network named Cross-modality Fusion and Progressive Integration Network (CFPI-Net) to address this problem. It consists of a Multi-level Cross-modality Feature Fusion (MCFF) module and a Multi-stage Progressive Feature Integration (MPFI) module. The MCFF module first captures hierarchical contexture features from each modality and then effectively fuses the hierarchical contexture features from different modalities at each level. The MPFI module involves multiple cascaded deeply supervised feature integration (DSFI) blocks in which the low-level and high-level cross-modality features are progressively integrated using the integrated features in the previous stage as a guidance. Our proposed CFPI-Net benefits from the advantages of multi-level feature representation, cross-modality feature fusion, and multi-stage progressive feature integration, which hereby fully boost the performance. Experimental results on two benchmark datasets demonstrate that CFPI-Net outperforms state-of-the-art saliency prediction methods both quantitatively and qualitatively. All the results and relevant codes will be made available to the public.

Research Area(s)

  • cross-modality fusion, Decoding, Feature extraction, Fuses, Pipelines, Predictive models, Progressive integration, saliency prediction, Stereoscopic 3D image, Three-dimensional displays, visual attention, Visualization