Intelligent Video Coding and Processing Techniques for Enhanced Virtual Environments

增強虛擬環境下的智能視頻編碼與處理技術

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date29 Jul 2024

Abstract

Recent years have witnessed strong demand for virtual avatars and virtual environments in enhanced online communication. At the same time, the widespread success of deep neural networks in various application domains has resulted in the machine as the ultimate data consumer in virtual environments. As a result, the application scenarios of the enhanced virtual environment and machine vision receiver present challenges for compression and processing methods. Thus, this thesis focuses on visual data compression for these virtual environments and machine-centric applications by fully considering intelligent video coding and processing techniques. It mainly consists of four parts: 1) quality harmonization for virtual composition in online video communications; 2) semantic face compression for metaverse: a compact 3D descriptor-based approach; 3) deep pre-editing and post-processing for image compression towards machine vision; 4) variable bitrate image compression for large visual-language models. The first one shows promising performance and sheds light on the development of future quality harmonization schemes in a variety of applications. The second one is expected to enable numerous applications, such as digital human communication based on machine analysis, and to form the cornerstone of interaction and communication in the metaverse. The third one is expected to shed light on the development of new image processing for machine vision algorithms. The fourth one is expected to provide insight into constructing the new image coding for large visual-language models with multi-modal data.

In the first part, we identify the quality harmonization problem of virtual composition in online video communication scenarios and propose a quality harmonization framework for virtual composition in online video communications, with the underlying principle that quality harmonization can be achieved by adaptively regulating the quality of the background. In particular, the quality of the background can be aligned with the foreground by compression, in which the optimal quantization parameter derived from the distorted foreground is applied. In light of this, we propose a quality assessment model that can infer the quality of the foreground without any reference, based upon which the compression parameter is derived with the modeling between distortion and quantization parameter (QP). The two models are learned in a self-supervised manner and present high-accuracy estimation results.

In the second part, we envision a new metaverse communication paradigm for virtual avatar faces and develop semantic face compression with compact 3D facial descriptors. The paradigm comprises a compression framework that transmits 3D face descriptors for semantic communication and applications based on the semantic descriptors. The fundamental principle is that the communication of virtual avatar faces primarily emphasizes the conveyance of semantic information. Subsequently, we show how the descriptors that represent the 3D face can be feasibly transmitted based on the end-to-end compression network. Finally, we introduce the utility of such framework towards intelligent applications, targeting at better understanding the face without reconstructing the signal.

In the third part, we propose an image compression for machine vision framework based upon end-to-end optimized pre-editing and post-processing modules that work at both the global and local levels. The proposed framework could focus on the critical semantic information to achieve better rate-accuracy performance. In particular, we propose a compression ratio-guided rescaling parameter estimation network in the pre-editing module to optimize the spatial resolution of images. Moreover, we develop a semantic attention-based model as pre-processing and post-enhancement networks to perform pixel-level image processing.

In the fourth part, we propose a variable bitrate image compression framework consisting of a pre-editing module and an end-to-end codec to achieve promising rate-accuracy performance for different LVLMs. In particular, we propose a compression ratio adaptive pre-editing network to process the input image based on semantic information at the pixel level. Moreover, a variable bitrate end-to-end image codec is developed to compress the preprocessed images. The pre-editing module and the codec are jointly trained by the losses based on semantic tokens of the large model, which introduce enhanced generalization capability for various data and tasks.

In summary, this thesis studies learning-based visual data compression and processing techniques. The characteristics of the enhanced virtual environments are systematically and specifically studied to improve compression performance for human and machine vision. The comprehensive evaluation validates the effectiveness and generalization capability of the proposed method, which will benefit various practical applications that require virtual environments, 3D vision, multi-modal visual data, and machine vision tasks.