Exploiting Spatial-Temporal Cues for Visual Data Enhancement

Student thesis: Doctoral Thesis

Abstract

Visual data is critical in diverse applications, such as media sharing, robotics, and augmented reality. However, practical constraints, including limited capture devices and low-light conditions, inevitably degrade the visual data, compromising its quality for human perception and downstream tasks. Low-light image/video enhancement aims to recover normal-light visual content from the corresponding degraded low-light data. This is a challenging task due to the ambiguous nature of the problem, as multiple normal-light versions may correspond to a single low-light input. This work proposes a comprehensive framework to address this problem. The key components include: 1) Introducing a controllable map to allow users to adjust the enhancement results according to their specific requirements, meeting the needs of diverse application scenarios. 2) Leveraging deep neural networks to effectively capture the temporal dependencies between video frames and spatial visual features, which is crucial for handling low-light visual data. 3) Designing an adaptive enhancement scheme based on local lighting conditions and temporal consistency, preserving local details while maintaining overall visual coherence.

The first one is enlightening low-light images with dynamic spatial guidance for context Enrichment. Images captured under low-light conditions often suffer from a myriad of visual quality degradations, such as poor visibility, diminished contrast, and excessive noise. These complex degradations, which vary across different regions (e.g., noise in smooth areas, overexposure in well-lit regions, and low contrast around edges), pose significant challenges for low-light image enhancement. To address this issue, we introduce a novel methodology that leverages a learnable guidance map derived from signal-level and deep priors. This allows the deep neural network to adaptively enhance low-light images in a region-dependent manner. Furthermore, we exploit the enhancement capability of the learnable guidance map through multi-scale dilated context collaboration, which leads to contextually enriched feature representations extracted by the model using various receptive fields. By assimilating the intrinsic perceptual information from the learned guidance map, our approach generates richer and more realistic textures, effectively mitigating the adverse effects of low-light conditions.

The second one is to investigate temporally consistent enhancement of low-light videos via spatial-temporal compatible learning. Temporal inconsistency is a notorious artifact that frequently plagues low-light video enhancement techniques. Regrettably, current methods often overlook the significance of leveraging both data-centric clues and model-centric design to tackle this problem effectively. In this context, our research undertakes a comprehensive exploration from three key angles. First, to enrich the scene diversity and motion flexibility, we have constructed a synthetic diverse low/normal-light paired video dataset. This dataset was developed using a carefully designed low-light simulation strategy, which can effectively complement existing real-captured datasets. Second, for better temporal dependency utilization, we have developed a Temporally Consistent Enhancer Network (TCE-Net). This network architecture consists of stacked 3D convolutions and 2D convolutions, enabling it to exploit spatial-temporal clues within videos. Finally, the temporal dynamic feature dependencies are exploited to obtain consistency constraints for different frame indices. All these efforts are powered by a Spatial-Temporal Compatible Learning (STCL) optimization technique, which dynamically constructs specific training loss functions adaptable to different datasets. By employing these approaches, multiple-frame information can be effectively utilized, and different levels of information from the network can be feasibly integrated. This, in turn, expands the synergies on various data types and offers visually superior results in terms of illumination distribution, color consistency, texture details, and temporal coherence. Extensive experimental results on diverse real-world low-light video datasets clearly demonstrate the proposed method's superior performance compared to state-of-the-art techniques.

The third one proposes a novel unrolled decomposed unpaired learning for controllable low-light video enhancement. Obtaining pairs of low/normal-light videos with motions is more challenging than still images, which raises technical issues and poses the technical route of unpaired learning as a critical role. We make endeavors in the direction of learning for low-light video enhancement without using paired ground truth. Compared to low-light image enhancement, enhancing low-light videos is more difficult due to the intertwined effects of noise, exposure, and contrast in the spatial domain, jointly with the need for temporal coherence. To address the above challenge, we propose the Unrolled Decomposed Unpaired Network (UDU-Net) for enhancing low-light videos by unrolling the optimization functions into a deep network to decompose the signal into spatial and temporal-related factors, which are updated iteratively. Firstly, we formulate low-light video enhancement as a Maximum A Posteriori estimation (MAP) problem with carefully designed spatial and temporal visual regularization. Then, via unrolling the problem, the optimization of the spatial and temporal constraints can be decomposed into different steps and updated in a stage-wise manner. From the spatial perspective, the designed Intra subnet leverages unpair prior information from expert photography retouched skills to adjust the statistical distribution. Additionally, we introduce a novel mechanism that integrates human perception feedback to guide network optimization, suppressing over/under-exposure conditions. Meanwhile, to address the issue from the temporal perspective, the designed Inter subnet fully exploits temporal cues in progressive optimization, which helps achieve improved temporal consistency in enhancement results. Consequently, the proposed method achieves superior performance to state-of-the-art methods in video illumination, noise suppression, and temporal consistency across outdoor and indoor scenes.

Therefore, this thesis studies the enhancement of visual data in practical scenarios and promotes the development of more immersive and realistic interactions. By seamlessly integrating these innovations, the proposed framework can robustly and controllably enhance the quality of visual data, providing high-quality inputs for subsequent computer vision tasks and improving the overall user experience. This work offers a new perspective on enhancing low-quality visual data, contributing to advancing computer vision and intelligent multimedia applications.
Date of Award19 Dec 2024
Original languageEnglish
Awarding Institution
  • City University of Hong Kong
SupervisorShiqi WANG (Supervisor)

Cite this

'