Abstract
With the exponential growth of image and video data, the massive volume of visual content presents significant challenges for storage and transmission. Efficiently conveying semantic information in low-bitrate scenarios has become crucial. To address this, compression paradigms are evolving towards semantic-driven visual coding, leveraging advancements in multimodal learning and generation techniques.This thesis explores three innovative paradigms for efficient semantic-driven encoding: (1) Fine granularity coding with multimodal learning, which incorporates semantic information into image compression to improve perceptual reconstruction and accuracy in downstream analysis; (2) Scalable Cross-Modality Compression (SCMC), which addresses image compression by utilizing hierarchical multimodal representations to enhance both semantic communication and perceptual reconstruction; and (3) Cross-Modality Video Coding (CMVC), which leverages Multimodal Large Language Models (MLLMs) for efficient video compression, ensuring high-quality semantic reconstruction and perceptual consistency through innovative encoding-decoding modes and frame interpolation techniques. The first paradigm focuses on fine-grained semantic preservation, the second emphasizes scalable cross-modal representations for efficient image compression, and the third enhances video compression by integrating multimodal learning for superior semantic reconstruction and consistency.
In the first part, we propose a fine granularity coding approach that enhances semantic representation in image compression. This method incorporates a conditional vector quantized diffusion model, embedding cross-modal semantic information within the compression pipeline. An anchor-based sampling strategy prioritizes semantically significant feature vectors, balancing representation cost and fidelity. The proposed method achieves semantically rich, visually appealing, and interactively scalable decoding, with extensive experiments demonstrating its competitive performance in downstream analysis accuracy and perceptual reconstruction quality.
In the second part, we introduce SCMC, a scalable cross-modality compression framework that hierarchically represents images across semantic, structural, and signal levels. This approach enables task-specific compression and scalable decoding, supporting diverse applications ranging from high-level semantic communication to low-level image reconstruction. Experimental results validate its ability to convey accurate semantic and perceptual information even at extremely low bitrates.
In the third part, we present CMVC, a novel paradigm that integrates MLLMs into video compression. On the encoder side, video content is disentangled into spatial and motion components, which are transformed into distinct modalities through MLLMs for compact representation. On the decoder side, video generative models enable innovative decoding strategies, including Text-Text-to-Video (TT2V) for high-quality semantic reconstruction and Image-Text-to-Video (IT2V) for perceptual consistency. Additionally, an efficient frame interpolation model based on Low-Rank Adaptation (LoRA) ensures smooth motion representation in IT2V decoding.
In summary, this thesis contributes to semantic-driven visual coding by proposing:
1) A fine granularity coding approach that enriches image compression with semantic information;
2) A scalable cross-modality compression framework that optimizes representations across semantic, structural, and signal layers; and
3) A cross-modality video coding paradigm that integrates MLLMs and generative models for efficient and flexible video reconstruction.
Extensive experimental results confirm the effectiveness of these proposed approaches, demonstrating significant improvements in semantic fidelity, compression efficiency, and downstream analytical performance.
| Date of Award | 24 Apr 2025 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Shiqi WANG (Supervisor) & Tak Wu Sam KWONG (External Co-Supervisor) |