Auxiliary Knowledge-based Learning for Visual Saliency, Shadow Detection and Reflection Removal


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date4 Sep 2020


When performing a specific vision task, humans tend to utilize auxiliary knowledge to aid the task. For example, humans may perceive semantics for visual saliency, utilize distractors to distinguish targets, integrate visual cues from dynamic sequences to address the static scene understanding issue. Similarly, to empower the computer the ability of understanding vision tasks, auxiliary knowledge should also be explored. In this thesis, we investigate several approaches to extract and utilize useful auxiliary knowledge for image understanding via focusing on some common visual tasks including visual saliency, shadow detection and reflection removal.

We first study an important problem of predicting visual saliency. In particular, we focus on webpages. To model humans' attention on webpages, we present an end-to-end learning framework that utilizes auxiliary knowledge of semantic layout for predicting task-driven visual saliency on webpages. Given a webpage, we propose a convolutional neural network to predict where people would typically look at it under different task conditions. We observe that given a specific task, human attention is strongly correlated with certain semantic components on a webpage (e.g., images, buttons and input boxes). Inspired by this observation, we design our network to explicitly disentangle saliency prediction into two independent sub-tasks, task-specific attention shift prediction and task-free saliency prediction. While the task-specific branch estimates task-driven attention shift over the webpage from its semantic components, the task-free branch infers visual saliency induced by visual features of the webpage. The outputs of the two branches are then combined to produce the final prediction. Such a task decomposition framework allows us to efficiently learn our model from a small-scale task-driven saliency dataset with sparse labels (captured under a single task condition). Experimental results show that our method outperforms the baselines and prior works, achieving state-of-the-art performance on a newly collected benchmark dataset for task-driven webpage saliency detection.

Beyond the webpage medium, we have also studied natural images. Shadow detection is an important and challenging task for scene understanding. In this work, we study the shadow detection with the auxiliary information of distractions. Despite promising results from recent deep learning-based methods. Existing works still struggle with ambiguous cases where the visual appearances of shadow and non-shadow regions are similar (referred to as distraction in our context). We propose a Distraction-aware Shadow Detection Network (DSDNet) by explicitly learning and integrating the semantics of visual distraction regions in an end-to-end framework. At the core of our framework is a novel standalone, differentiable Distraction-aware Shadow (DS) module, which allows us to learn distraction-aware, discriminative features for robust shadow detection, by explicitly predicting false positives and false negatives. We conduct extensive experiments on three public shadow detection datasets, SBU, UCF and ISTD, to evaluate our method. Experimental results demonstrate that our model can boost shadow detection performance, by effectively suppressing the detection of false positives and false negatives, achieving state-of-the-art performances.

Finally, we study an image restoration problem of single-image reflection removal with the auxiliary information from multi-view images. Single-image reflection removal aims to restore the transmitted image given a single image shot through a window or glass. Existing methods mainly rely on information extracted from a single image along with some pre-defined priors. We observe that humans would change their viewpoints and then watch how the content changes (due to layer dynamics of the transmitted/reflected contents) to differentiate the transmitted content from the reflected content. Inspired by this observation, we learn a representation of layer dynamics from multi-view images and transfer the learned knowledge for single-image reflection removal. In particular, we propose a teacher-student framework where a teacher network watching multi-view images teaches a student network to remove reflection from a single input image. In addition, to address the difficulty of constructing a multi-view dataset, we propose to synthesize training images via a translation-based approach, which is able to mimic layer dynamics caused by small camera movements. Extensive experiments show that our model produces state-of-the-art performances.