Salient Object Detection from Multi-modal Data

多模態數據的顯著物體檢測

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date28 Aug 2019

Abstract

The increasing availability of various sensors (e.g., infrared sensors, depth cameras and optical cameras) has given traditional RGB-based computer vision systems the ability to perceive with a wide view, achieve good performance, and be applicable in other scenarios. Meanwhile, the development of deep learning enables computer vision systems to understand scenes from a deep perspective. This thesis focuses on using deep learning techniques for co-inference of the salient object/objects from multi-modal data, such as RGB-depth and RGB-thermal image pairs. Previous solutions on this task mainly follow two paradigms: (a) crafting multi-modal features with prior knowledge, which is, however, nontrivial and cannot be well generalized to all contexts; (b) inferring saliency from each modality separately and then solving the multi-modal fusion problem by using straightforward combination schemes. Nonetheless, the cross-modal complements are not well integrated for better representations.

In this thesis, we approach the problem of multi-modal salient object detection from a systematic view: modal-specific representation learning, complementary cue selection and cross-modal complement fusion. Following this guidance, we leverage deep neural networks as tools and contribute careful designs on learning, selecting and fusing cross-modal cross-level representations, including (1) a discriminative distillation transfer method, which realizes better learning of new modalities by using inexpensive modality labels; (2) a hierarchical cross-modal distillation method, which introduces supervisory signals from source data to guide the optimization of the model for unlabeled target data effectively; (3) a multi-scale multi-path fusion network, which can supply adaptive and flexible fusion flows, thereby easing the optimization and enabling sufficient and efficient multi-modal fusion; (4) a progressively complementarity-aware fusion network (PCA-Net), in which the residual function is adopted to model cross-modal complements; (5) a three-stream attention-aware fusion network (TA-Net), where a cross-modal distillation stream is crafted to favor multi-modal fusion in the bottom-up process and the attention mechanism is introduced for selective combinations of cross-modal cross-level features; and (6) a densely cross-level feedback topology, which enjoys rich multi-scale multi-modal representations, selective combinations and informative collaborations across modalities and levels.

Comprehensive experiments verify that the designs in this thesis successfully extract rich modal-specific representations from the new modality with limited labeled data, are attentive to cross-modal complementary cross-modal representations, and combine multi-scale crossmodal features well for sufficient multi-modal fusion.

    Research areas

  • RGB-D, cross-modal transfer, multi-modal fusion, salient object detection