Salient Object Detection from Multi-modal Data


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date28 Aug 2019


The increasing availability of various sensors (e.g., infrared sensors, depth cameras and optical cameras) has given traditional RGB-based computer vision systems the ability to perceive with a wide view, achieve good performance, and be applicable in other scenarios. Meanwhile, the development of deep learning enables computer vision systems to understand scenes from a deep perspective. This thesis focuses on using deep learning techniques for co-inference of the salient object/objects from multi-modal data, such as RGB-depth and RGB-thermal image pairs. Previous solutions on this task mainly follow two paradigms: (a) crafting multi-modal features with prior knowledge, which is, however, nontrivial and cannot be well generalized to all contexts; (b) inferring saliency from each modality separately and then solving the multi-modal fusion problem by using straightforward combination schemes. Nonetheless, the cross-modal complements are not well integrated for better representations.

In this thesis, we approach the problem of multi-modal salient object detection from a systematic view: modal-specific representation learning, complementary cue selection and cross-modal complement fusion. Following this guidance, we leverage deep neural networks as tools and contribute careful designs on learning, selecting and fusing cross-modal cross-level representations, including (1) a discriminative distillation transfer method, which realizes better learning of new modalities by using inexpensive modality labels; (2) a hierarchical cross-modal distillation method, which introduces supervisory signals from source data to guide the optimization of the model for unlabeled target data effectively; (3) a multi-scale multi-path fusion network, which can supply adaptive and flexible fusion flows, thereby easing the optimization and enabling sufficient and efficient multi-modal fusion; (4) a progressively complementarity-aware fusion network (PCA-Net), in which the residual function is adopted to model cross-modal complements; (5) a three-stream attention-aware fusion network (TA-Net), where a cross-modal distillation stream is crafted to favor multi-modal fusion in the bottom-up process and the attention mechanism is introduced for selective combinations of cross-modal cross-level features; and (6) a densely cross-level feedback topology, which enjoys rich multi-scale multi-modal representations, selective combinations and informative collaborations across modalities and levels.

Comprehensive experiments verify that the designs in this thesis successfully extract rich modal-specific representations from the new modality with limited labeled data, are attentive to cross-modal complementary cross-modal representations, and combine multi-scale crossmodal features well for sufficient multi-modal fusion.

    Research areas

  • RGB-D, cross-modal transfer, multi-modal fusion, salient object detection