From CNN to Transformer: Deep Learning-Based Image Restoration and Enhancement
從 CNN 到 Transformer: 基於深度學習的圖像重建與增強
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 28 Dec 2023 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(b6809e31-5329-416e-bfae-bbe7f165f4af).html |
---|---|
Other link(s) | Links |
Abstract
Deep learning-based methods have achieved remarkable success due to their powerful modeling capabilities. Specifically, Convolutional Neural Networks (CNN) have dominated image processing for various tasks, such as image restoration and image enhancement. However, the principle of local processing limits CNN in terms of receptive fields and brings challenges in capturing long-range dependencies. Moreover, the weight sharing of the convolution layer leads to content-independent interactions between images and filters. Unlike prior convolution-based deep neural networks, the emergence of the transformer architecture, originally designed for natural language processing tasks and well-suited for modeling global contexts, has been considered as an alternative to CNN, striving to achieve better performance with a simple and general-purpose neural architecture.
In this thesis, we first provide a summary of the current stage of the application of CNN and transformer in image processing. We discuss the strengths and limitations of CNN and highlight the unique characteristics of transformer that make them well-suited for image restoration and enhancement tasks. Then, we propose a hybrid framework to explore the representation capacity of local and global features for end-to-end compressive image sensing, composed of adaptive image sampling and restoration. The designed framework consists of a concurrent CNN stem and transformer stem, enabling simultaneous calculation of fine-grained and long-range features and efficient aggregation of the results. Moreover, we explore a progressive strategy and a window-based transformer block to reduce the parameters and computational complexity. The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing, which achieves superior performance in both qualitative and quantitative evaluations compared to state-of-the-art methods on different datasets.
In the second part, we employ a transformer block that calculates self-attention across channels instead of the spatial dimension to achieve optimal complexity for high-resolution images. Taking the transformer as our baseline network, we introduce external memory to form an external memory-augmented network for low-light image enhancement. Benefiting from the learned memory, the network can "remember" more complex distributions of reference images in the entire dataset, facilitating the adjustment of testing samples more adaptively. It is worth mentioning that the proposed external memory is a plug-and-play mechanism that can be integrated with any existing method to further improve the enhancement quality. Both quantitative and qualitative results show that the proposed model effectively improves the quality of the enhanced images.
In the third part, we incorporate the high-quality code bank as our prior within the transformer to guide the low-light image enhancement. Specifically, We first pre-train a VQGAN on extensive high-quality image datasets to capture the high-quality prior of the normal-light conditions. This prior is stored in a discrete codebook and its corresponding decoded feature space, forming the code bank used to guide the enhancement process. To align low-light features with undistorted normal-light code bank features effectively, we introduce a code bank-guided block within our enhancement network, which is integrated into the transformer to leverage the prior information. In comparison with the state-of-the-art methods, the quantitative and qualitative experimental results on the paired dataset and unpaired datasets with various evaluation metrics show the superiority of our method.
In this thesis, we first provide a summary of the current stage of the application of CNN and transformer in image processing. We discuss the strengths and limitations of CNN and highlight the unique characteristics of transformer that make them well-suited for image restoration and enhancement tasks. Then, we propose a hybrid framework to explore the representation capacity of local and global features for end-to-end compressive image sensing, composed of adaptive image sampling and restoration. The designed framework consists of a concurrent CNN stem and transformer stem, enabling simultaneous calculation of fine-grained and long-range features and efficient aggregation of the results. Moreover, we explore a progressive strategy and a window-based transformer block to reduce the parameters and computational complexity. The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing, which achieves superior performance in both qualitative and quantitative evaluations compared to state-of-the-art methods on different datasets.
In the second part, we employ a transformer block that calculates self-attention across channels instead of the spatial dimension to achieve optimal complexity for high-resolution images. Taking the transformer as our baseline network, we introduce external memory to form an external memory-augmented network for low-light image enhancement. Benefiting from the learned memory, the network can "remember" more complex distributions of reference images in the entire dataset, facilitating the adjustment of testing samples more adaptively. It is worth mentioning that the proposed external memory is a plug-and-play mechanism that can be integrated with any existing method to further improve the enhancement quality. Both quantitative and qualitative results show that the proposed model effectively improves the quality of the enhanced images.
In the third part, we incorporate the high-quality code bank as our prior within the transformer to guide the low-light image enhancement. Specifically, We first pre-train a VQGAN on extensive high-quality image datasets to capture the high-quality prior of the normal-light conditions. This prior is stored in a discrete codebook and its corresponding decoded feature space, forming the code bank used to guide the enhancement process. To align low-light features with undistorted normal-light code bank features effectively, we introduce a code bank-guided block within our enhancement network, which is integrated into the transformer to leverage the prior information. In comparison with the state-of-the-art methods, the quantitative and qualitative experimental results on the paired dataset and unpaired datasets with various evaluation metrics show the superiority of our method.