Abstract
For a long time, visual data has played a critical role in various fields, and its importance continues growing as imaging and display technologies advance. with the evolution of cameras and display devices, the resolution and quality of visual data have significantly increased. Meanwhile, the recent development of LVLMs and the emergence of AI-generated visual data have further expanded the scope of visual content. However, the sheer volume of such data necessitates compression to facilitate storage and transmission. Compression inevitably introduces noise and artifacts, leading to several challenges: it diminishes the viewing experience at the receiver's end by degrading visual quality, hampers the perception of authenticity in AI-generated content, and influences the semantic understanding of visual data by LVLMs. These challenges underline the need for deeper investigation into compression techniques tailored for AIGC and LVLM-oriented applications as well as strategies to mitigate the adverse effects of compression on both human perception and multimodal models. Thus, drawing on perspectives from humans and LVLMs, this study delves into the enhancement and analysis of compressed data, further advancing the application and development of multimodal technologies in intelligent vision.It mainly consists of three parts: 1) exploiting bidirectional quality impulse for reference picture resampled gaming video coding; 2) examining the role of compression in influencing Al-generated image authenticity; 3) exploring the ability of large vision language models to handle images with compression degradation. The first part aims to leverage the flexibility of RPR by developing a periodic quality impulse structure and a bidirectional guidance model, effectively reducing encoder-side data volume and enhancing gaming video quality. The second part systematically examines how compression artifacts affect the subjective authenticity of AI-generated images. The last part evaluates the visual understanding performance of GPT-4o at different compression levels, analyzing how image compression affects its semantic analysis capabilities and offering important insights into its resilience under real-world scenarios.
In the first part, we explore the enhancement of compressed gaming video quality within the VVC RPR framework by leveraging the concept of periodic quality impulses. A novel RPR structure is introduced, in which high-resolution frames are strategically placed to provide strong reference cues for the restoration of adjacent low-resolution frames. Built upon this structure, a bidirectional enhancement model is proposed to exploit temporal correlations from both directions, improving the reconstruction quality of downsampled frames. To facilitate model training and evaluation, we construct a dedicated GamingQE dataset covering diverse gaming genres and compression levels. Experimental results demonstrate that the proposed approach yields substantial bitrate reductions and superior perceptual quality compared to existing methods, highlighting its potential in latency-tolerant game streaming applications.
In the second part of this study, we investigate how compression distortion affects the subjective perception of authenticity in AIGIs. To facilitate this, we build the first AIGI dataset for subjective authenticity evaluation, containing 500 AIGIs and 500 natural images at 768$\times$768 resolution. The images are categorized into five major classes and twenty subcategories to cover diverse content. Four compression levels (QP = 22, 32, 42, 52) are applied using the standard VVC codec. Results show that heavier compression leads to lower human accuracy in identifying AIGIs. We also train a model on the collected data to predict the probability that an image will be perceived as an AIGI. This work offers insights for future research on balancing authenticity and visual quality under compression.
In the third part, we investigate how compression distortion affects the image understanding capabilities of LVLMs. To support this, we construct a new image-text dataset named GPT-COMP, containing 80,000 natural scene images, including 20,000 raw images from two public datasets. Each image is compressed at three levels (QP = 32, 42, 52) using the latest VVC Test Model. These images are processed by GPT-4o, and its text responses are included in GPT-COMP for evaluation. Scene understanding under different compression levels is assessed based on extracted semantics and a proposed self-scoring strategy. This study offers a basis for evaluating and enhancing the robustness of LVLMs under compression.
Therefore, this thesis integrates conventional compression optimization with cutting-edge AI-driven analysis, contributing to the application of next-generation coding standards and the efficient compression, storage, and transmission of high-quality visual data. Specifically, by examining how RPR-based video coding, compression-induced authenticity degradation in AI-generated images, and LVLM-driven semantic analysis interact under various compression scenarios, this thesis provides a comprehensive framework to address challenges in multimedia systems. The proposed methods are objectively evaluated through extensive experiments, offering valuable insights for industrial applications and driving down costs associated with video data processing.
| Date of Award | 24 Apr 2025 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Shiqi WANG (Supervisor) |