Visual Quality Assessment: From Deep Unimodal Approaches to Large Multimodal Models

Student thesis: Doctoral Thesis

Abstract

The primary goal of objective visual quality assessment is to automatically predict the perceptual visual quality, providing a cost-effective alternative for the cumbersome subjective user study. The advancement of quality assessment models holds a pivotal position in human visual perception and computational vision systems. While deep learning technologies have achieved remarkable success in visual quality assessment, traditional deep learning-based methods often utilize unimodal data (i.e., video or image) to build advanced quality assessment models exclusively. This leaves the exploration of quality assessment capabilities for large multimodal models (LMMs) relatively underdeveloped. This thesis explores improving visual quality assessment through innovative methodologies that go beyond traditional unimodal approaches to include LMMs, which contain the following four parts.

In the first part, we propose a spatial-temporal interactive video quality assessment (STI-VQA) model based on the philosophy that video distortion can be inferred from the integration of both spatial characteristics and temporal motion, along with the flow of time. In particular, for each timestamp, both the spatial distortion explored by the feature statistics and local motion captured by feature difference are extracted and fed to a transformer network for motion-aware interaction learning. Meanwhile, the information flow of spatial distortion from the shallow layer to the deep layer is constructed adaptively during the temporal aggregation.

In the second part, we take initial steps to probe the image quality assessment (IQA) capability of existing LMMs by employing the two-alternative forced choice (2AFC) prompting, as 2AFC is widely regarded as the most reliable way of collecting human opinions of visual quality. Subsequently, the global quality score of each image estimated by a particular LMM can be efficiently aggregated using the maximum a posteriori estimation. Meanwhile, we introduce three evaluation criteria: consistency, accuracy, and correlation, to provide comprehensive quantifications and deeper insights into the IQA capability of five LMMs.

In the third part, we extend the edge of emerging LMM to advance visual quality comparison into open-ended settings, that 1) can respond to open-range questions on quality comparison; 2) can provide detailed reasonings beyond direct answers. To this end, we propose the Co-instruct. To train this first-of-its-kind open-source open-ended visual quality comparer, we collect the Co-Instruct-562K dataset, from two sources: a) LLM-merged single image quality description, b) GPT-4V "teacher'' responses on unlabeled data. Furthermore, to better evaluate this setting, we propose the MicBench, the first benchmark on multi-image comparison for LMMs.

In the fourth part, we introduce an all-around LMM-based no-reference IQA~(NR-IQA) model, which is capable of producing qualitatively comparative responses and effectively translating these discrete comparative levels into a continuous quality score. During training, we present to generate scaled-up comparative instructions by comparing images from the same IQA dataset, allowing for more flexible integration of diverse IQA datasets. We develop a human-like visual quality comparator using the established large-scale training corpus. During inference, moving beyond binary choices, we propose a soft comparison that calculates the likelihood of the test image being preferred over multiple predefined anchor images. The quality score is further optimized by maximum a posteriori estimation with the resulting probability matrix.

Overall, this thesis enhances the performance of visual quality assessment models through five key advancements: 1) the development of a spatial-temporal interactive VQA model, 2) an exploration and enhancement of the IQA capabilities of LMMs, 3) extending LMMs for open-ended visual quality comparison, and 4) the introduction of an all-around LMM-based NR-IQA model. Extensive experimental results verify the effectiveness and superiority of these innovative approaches.
Date of Award19 Dec 2024
Original languageEnglish
Awarding Institution
  • City University of Hong Kong
SupervisorShiqi WANG (Supervisor)

Keywords

  • Image quality assessment
  • video quality assessment
  • vision transformer
  • large multimodal model

Cite this

'