Computational Prior-Based Learning for Scene Understanding

基於可計算先驗學習的場景理解

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
  • Rynson W H LAU (Supervisor)
  • Lizhuang Ma (External person) (External Supervisor)
Award date23 Jun 2022

Abstract

Humans highly rely on known prior knowledge for scene understanding. The priors are helpful to achieve reliable scene understanding objectives for diverse scenes. It is essential to teach the priors to vision system for improving understanding performances. However, it is challenging to represent the computational priors since they are abstract. In this thesis, we analyze different useful priors for scene understanding and propose the computational priors for scene understanding, including salient object detection, night-time scene parsing and mirror detection.

We first study the salient object detection problem based on the subitizing prior. Salient object detection aims at detecting the most visually distinct objects and producing the corresponding masks. As the cost of pixel-level annotations is high, image tags are usually used as weak supervisions. However, an image tag can only be used to annotate one class of objects. In this thesis, we introduce saliency subitizing as the weak supervision since it is class-agnostic. This allows the supervision to be aligned with the property of saliency detection, where the salient objects of an image could be from more than one class. To this end, we propose a model with two modules, Saliency Subitizing Module (SSM) and Saliency Updating Module (SUM). While SSM learns to generate the initial saliency masks using the subitizing information, without the need for any unsupervised methods or some random seeds, SUM helps iteratively refine the generated saliency masks. We conduct extensive experiments on five benchmark datasets. The experimental results show that our method outperforms other weakly-supervised methods and even performs comparable to some fully-supervised methods.

We then study the night-time scene parsing problem based on the exposure prior. Although huge progress has been made on scene analysis in recent years, most existing works assume the input images to be in day-time with good lighting conditions. In this thesis, we aim to address the night-time scene parsing (NTSP) problem, which has two main challenges: 1) labeled night-time data are scarce, and 2) over- and under-exposures may co-occur in the input night-time images and are not explicitly modeled in existing pipelines. To tackle the scarcity of night-time data, we collect a novel labeled dataset, named NightCity, containing 4,297 real night-time images with ground truth pixel-level semantic annotations. To our knowledge, NightCity is the largest dataset for NTSP. In addition, we also propose an exposure-aware framework to address the NTSP problem through augmenting the segmentation process with explicitly learned exposure features. Extensive experiments show that training on NightCity can significantly improve NTSP performances and that our exposure-aware model outperforms the state-of-the-art methods, yielding top performances on our dataset as well as existing datasets.

Finally, we study the mirror detection problem based on the visual chirality prior. Mirror detection is challenging because the visual appearances of mirrors change depending on those of their surroundings. As existing mirror detection methods are mainly based on extracting contextual contrast and relational similarity between mirror and non-mirror regions, they may fail to identify a mirror region if these assumptions are violated. Inspired by a recent study of applying a CNN to help distinguish whether an image is flipped or not based on the visual chirality property, in this thesis, we rethink this image-level visual chirality property and reformulate it as a learnable pixel level cue for mirror detection. Specifically, we first propose a novel flipping-convolution-flipping (FCF) transformation to model visual chirality as learnable commutative residual. We then propose a novel visual chirality embedding (VCE) module to exploit this commutative residual in multi-scale feature maps, to embed the visual chirality features into our mirror detection model. Besides, we also propose a visual chirality-guided edge detection (CED) module to integrate the visual chirality features with contextual features for detection refinement. Extensive experiments show that the proposed method outperforms state-of-the-art methods on three benchmark datasets.