Learning-based Video Compression and Compressed Visual Data Enhancement
基於學習的視頻壓縮和壓縮視覺數據增强
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 20 Dec 2023 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(20a902ba-9141-42f7-9e32-74a5d055cb37).html |
---|---|
Other link(s) | Links |
Abstract
Over the past few decades, a series of image and video coding standards, such as JPEG, HEVC, AVS, and VVC, have achieved significant compression efficiency improvement. Recently, prominent improvement has been observed in learning-based image and video compression, which propels the further exploration of the data-driven compression scheme. Moreover, lossy compression inevitably brings quality degradation into the compressed visual data, which could even affect the downstream computer vision tasks. Therefore, compressed visual data enhancement is of great importance to various applications. Thus, this thesis focuses on learning-based video compression and compressed visual data enhancement. It mainly consists of four parts: 1) the multiple hypotheses based motion compensation for the learning-based video compression; 2) learning-based video compression with online optimal compact representation and hierarchical rate-distortion optimization; 3) content-aware quality enhancement based on online learning and model compression; 4) a novel transformer-based raindrop removal method on compressed images.
In the first part, we develop the multiple hypotheses based motion compensation for the learning-based video compression, in an effort to enhance the motion compensation efficiency by providing diverse hypotheses with efficient temporal information fusion. The multiple hypotheses module is proposed to provide various hypotheses inferences from the reference frame and the hypotheses attention module is adopted to achieve more copious utilization of the multiple hypotheses. Extensive experiments show that the proposed method can significantly improve the rate-distortion performance of learning-based video compression.
In the second part, we propose an online learning scheme with optimal compact representation and hierarchical rate-distortion optimization for learning-based video compression. Specifically, the online refinement on the latent representation for each frame is introduced to obtain the optimal compact representation. Moreover, the hierarchical rate-distortion optimization is proposed to pursue higher compression efficiency, where the hierarchical weights are used for successive frames. Experimental results demonstrate that the proposed scheme can achieve admirable coding gains and outperform the state-of-the-art learning-based video compression approach.
In the third part, we present a novel quality enhancement method based on online deep learning with model compression to remove the artifacts induced by video compression. Particularly, a deep model, which absorbs and captures the characteristics of input signals, is produced and updated with the online learning strategy. To efficiently represent the to-be-signaled model information, the deep model compression is adopted by introducing the conv quantization block (CQB) and rate constraints on residue. Moreover, the rate-utility optimization is employed to select the optimal model for guaranteeing the overall performance. Extensive experiments show that our method can significantly improve the quality of compressed video.
In the fourth part, we propose a novel transformer architecture that leverages the advantages of attention mechanism and high-frequency-friendly design to effectively restore the compressed raindrop images at the framework, component, and module levels. Specifically, at the framework level, we integrate relative position multi-head self-attention and convolutional layers into the proposed low-high-frequency transformer (LHFT). At the component level, we utilize high-frequency depth-wise convolution (HFDC) with zero-mean kernels to improve the capability to extract high-frequency features. Finally, at the module level, we introduce a low-high-attention module (LHAM) to adaptively allocate the importance of low and high frequencies for effective fusion. Experimental results demonstrate that the proposed method outperforms the state-of-the-art methods.
Therefore, this thesis studies the learning-based video compression and the enhancement of compressed visual data. The deep learning techniques are systematically and specifically studied to promote the performance of learning video compression and compressed visual data enhancement. Extensive evaluations demonstrate the effectiveness of the proposed methods, which will benefit various practical applications that require efficient transmission, storage, and high quality visual data.
In the first part, we develop the multiple hypotheses based motion compensation for the learning-based video compression, in an effort to enhance the motion compensation efficiency by providing diverse hypotheses with efficient temporal information fusion. The multiple hypotheses module is proposed to provide various hypotheses inferences from the reference frame and the hypotheses attention module is adopted to achieve more copious utilization of the multiple hypotheses. Extensive experiments show that the proposed method can significantly improve the rate-distortion performance of learning-based video compression.
In the second part, we propose an online learning scheme with optimal compact representation and hierarchical rate-distortion optimization for learning-based video compression. Specifically, the online refinement on the latent representation for each frame is introduced to obtain the optimal compact representation. Moreover, the hierarchical rate-distortion optimization is proposed to pursue higher compression efficiency, where the hierarchical weights are used for successive frames. Experimental results demonstrate that the proposed scheme can achieve admirable coding gains and outperform the state-of-the-art learning-based video compression approach.
In the third part, we present a novel quality enhancement method based on online deep learning with model compression to remove the artifacts induced by video compression. Particularly, a deep model, which absorbs and captures the characteristics of input signals, is produced and updated with the online learning strategy. To efficiently represent the to-be-signaled model information, the deep model compression is adopted by introducing the conv quantization block (CQB) and rate constraints on residue. Moreover, the rate-utility optimization is employed to select the optimal model for guaranteeing the overall performance. Extensive experiments show that our method can significantly improve the quality of compressed video.
In the fourth part, we propose a novel transformer architecture that leverages the advantages of attention mechanism and high-frequency-friendly design to effectively restore the compressed raindrop images at the framework, component, and module levels. Specifically, at the framework level, we integrate relative position multi-head self-attention and convolutional layers into the proposed low-high-frequency transformer (LHFT). At the component level, we utilize high-frequency depth-wise convolution (HFDC) with zero-mean kernels to improve the capability to extract high-frequency features. Finally, at the module level, we introduce a low-high-attention module (LHAM) to adaptively allocate the importance of low and high frequencies for effective fusion. Experimental results demonstrate that the proposed method outperforms the state-of-the-art methods.
Therefore, this thesis studies the learning-based video compression and the enhancement of compressed visual data. The deep learning techniques are systematically and specifically studied to promote the performance of learning video compression and compressed visual data enhancement. Extensive evaluations demonstrate the effectiveness of the proposed methods, which will benefit various practical applications that require efficient transmission, storage, and high quality visual data.