Visual Quality Enhancement: Exploring Collaborative Guided Learning with Adaptive Fusion

Student thesis: Doctoral Thesis

Abstract

Visual quality of images and videos plays a major role in not only human visual perception, but also outdoor recognition tasks. However, the imaging process is prone to be degraded in different capturing conditions. This is because that different visual perceptions from various perspectives or limited capturing devices (e.g., frame rates and resolutions) tend to bring extremely different visual experiences. As a result, the quality degradation presents significant challenges for texture preservation and temporal smoothing of image and video, respectively. Thus, this thesis focuses on improving the visual quality by exploiting the collaborative learning with adaptive fusion. It mainly consists of three parts: 1) Designing intra-view feature enhancement followed by inter-view feature alignment and fusion to benefit from similar feature correspondence across different views for a new research problem multi-view low-light image enhancement; 2) Incorporating both long-range and short-range temporal dependence to model the temporally correlated correspondence, improving the temporal consistency for low-light video enhancement; 3) Integrating the inherent structural correlation between high-resolution (HR) panchromatic (PAN) image and multi-spectral (MS) target image with different structural experts for Pan-Sharpening.

In the first part, we make the first attempt to investigate multi-view low-light image enhancement. First, we construct a new dataset called Multi-View Low-light Triplets (MVLT), including 1,860 pairs of triple images with large illumination ranges and wide noise distribution. Each triplet is equipped with three different viewpoints towards the same scene. Second, we propose a deep multi-view enhancement framework based on the Recurrent Collaborative Network (RCNet). Specifically, in order to benefit from similar texture correspondence across different views, we design the recurrent feature enhancement, alignment and fusion (ReEAF) module, in which intra-view feature enhancement (Intra-view EN) followed by inter-view feature alignment and fusion (Inter-view AF) is performed to model the intra-view and inter-view feature propagation sequentially via multi-view collaboration. In addition, two different modules from enhancement to alignment (E2A) and from alignment to enhancement (A2E) are developed to enable the interactions between Intra-view EN and Inter-view AF, which explicitly utilize attentive feature weighting and sampling for enhancement and alignment, respectively. Experimental results demonstrate that our RCNet significantly outperforms other state-of-the-art methods.

In the second part, we propose the Long-short Temporal Filtering Network (TFNet) to learn the mapping from low-light videos to normal-light ones, utilizing the well-considered data-centric strategy and a refined architecture. From the data-centric temporal strategy, we incorporate both long-range and short-range temporal dependence into TFNet, effectively capturing the temporal information. From the model design perspective, the TFNet incorporates the Temporal-aware Attentional Filtering (TAF) module, which aims to estimate  and adaptively combine filtering kernels for guided filtering towards features of the middle frame. To further refine the filtered features, the cascaded Grouped Attention (GA) blocks are presented in a grouped attention strategy. Experimental results on benchmark datasets have demonstrated the superiority of TFNet against the state-of-the-art methods on video frame quality and brightness consistency.

In the third part, we propose the Multi-modal Structural Mixture of Experts (MS-MoE) framework for pan-sharpening. Pan-sharpening aims to generate the high-resolution (HR) multi-spectral (MS) target image from its low-resolution (LR) counterpart, which is guided by corresponding HR panchromatic (PAN) image with abundant texture structural details. Specifically, given the upsampled LRMS and PAN images spatially rotated at various angles, we design a set of structural experts to extract the complementary spatial and spectral features between them, in which the Texture Enhancement Module (TEM) is introduced to extract and enhance texture-structural features from different modalities. Subsequently, we introduce an additional expert network to perform feature fusion by integrating the outputs from multiple experts. To reconstruct the high-frequency information, we further leverage the Frequency feature Refinement Module (FRM) to aggregate and refine the fused features in the frequency domain. Experimental results on benchmark pan-sharpening datasets demonstrate that the MS-MoE framework achieves more competitive performance than recent state-of-the-art methods.
 
In summary, this thesis studies learning-based visual data compression and processing techniques. The characteristics of the enhanced virtual environments are systematically and specifically studied to improve compression performance for human and machine vision. The comprehensive evaluation validates the effectiveness and generalization capability of the proposed method, which will benefit various practical applications that require virtual environments, 3D vision, multi-modal visual data, and machine vision tasks.
Date of Award8 Aug 2025
Original languageEnglish
Awarding Institution
  • City University of Hong Kong
SupervisorShiqi WANG (Supervisor)

Cite this

'