Deep Learning-based Video Compression by Leveraging Spatial-Temporal Correlations

Student thesis: Doctoral Thesis

Abstract

Given the remarkable advancements in deep learning, learning-based compression has garnered significant interest and achieved substantial improvements. This thesis focuses on deep learning-based video compression by leveraging spatial-temporal correlations. It is structured into four key parts: 1) enhanced motion compensation to reduce error propagation in deep video compression, 2) improved context mining and filtering to increase the compression efficiency of deep contextual video compression, 3) the extension of invertible encoding to temporal information, along with the introduction of an encoding-decoding network for deep video compression, 4) a sparse-to-dense entropy modelling for bidirectional deep video compression, and 5) a model-based deep video compression method that enables switching between intra coding and inter coding.

In the first part, we propose an enhanced motion compensation for reduced error propagation in deep video compression. Specifically, we incorporate the designed convolutional neural network into Open DVC as the motion compensation enhancement network to remove noise in the predicted frame. With the enhanced frame, we jointly optimize the whole framework with a single loss function by considering the trade-off between bit cost and frame quality. Experiments show that the proposed enhanced motion compensation model reduces error propagation within a group of frames.

In the second part, we propose enhanced context mining and filtering to improve the compression efficiency of DCVC. First, considering that the context of DCVC is generated without supervision and that redundancy may exist among context channels, an enhanced context mining model is proposed to mitigate redundancy across context channels to obtain superior context features. Then, we introduce a transformer-based enhancement network as a filtering module to capture long-distance dependencies and further enhance compression efficiency. The transformer-based enhancement adopts a full-resolution pipeline and calculates self-attention across channel dimensions. By combining the local modelling ability of the enhanced context mining model and the nonlocal modelling ability of the transformer-based enhancement network, our model outperforms the LDP configurations of versatile video coding (VVC).

In the third part, considering autoencoder-style networks for encoding and decoding, which can result in the loss of information during encoding that cannot be retrieved during decoding, we propose a new approach that extends invertible encoding to temporal information and introduces an encoding-decoding network for deep video compression. Our network incorporates a novel attentive channel squeeze module to improve compression performance while also leveraging a conditional coding framework for motion compression. The experimental results demonstrate the effectiveness of our approach.

In the fourth part, we introduce a learning-based bidirectional video compression method with sparse-to-dense entropy modelling. We employ two reference frames to produce a single reference frame for conditional compression. Specifically, we leverage a Siamese neural network to acquire feature similarity values from the two reference frames. These values, in conjunction with related frames, are utilized to generate a single reference frame for conditional compression. Furthermore, most existing deep video codecs model the entropy with spatial and temporal correlations, but do not leverage the channel characteristics. The channels are treated equally in the current deep video compression entropy design. However, the distribution of entropy among channels is not even, and the beginning channels contain denser entropy. We design sparse-to-dense entropy modelling by taking into account the distribution of entropy among channels. We propose to predict the beginning channels with the help of the deeper channels' parameters. With our design, the deeper channels with sparser entropy are sequentially modeled, and the beginning channels with dense entropy have more inputs as references. The results of the experiment demonstrate the correctness and effectiveness of the above designs. Our proposed method can outperform the versatile video coding (VVC) random access (RA) configuration, achieving an average of 31.79% bit savings in terms of MS-SSIM.

In the fifth part, we investigate the concept of model-based deep learning in video compression. Specifically, we explore the integration of knowledge-based techniques with deep learning for video compression. Unlike existing deep video compression methods that rely on first computing inter-temporal motion and then using reconstructed motion for frame compression with predictive coding or conditional coding, our approach focuses on learning intra-neural networks and inter-neural networks. We propose a model-based deep video compression method that allows for switching between intra-coding and inter-coding by computing rate distortion, similar to traditional video compression methods. Compared with previous end-to-end deep video compression techniques, our approach has a shorter decoding time because of the dynamic switching between inter- and intra-modes within our scheme.

In summary, this thesis systematically and specifically investigates deep learning techniques to increase the performance of learning-based video compression. Comprehensive evaluations demonstrate the effectiveness of the proposed methods, which will benefit a range of practical applications requiring efficient transmission, storage, and high-quality visual data.
Date of Award24 Apr 2025
Original languageEnglish
Awarding Institution
  • City University of Hong Kong
SupervisorShiqi WANG (Supervisor) & Tak Wu Sam KWONG (External Co-Supervisor)

Cite this

'