Abstract
The rapid development of generative Artificial Intelligence (AI) has led to an explosive growth of AI-synthesized visual content, spanning human face videos, photorealistic synthetic images, and other multimodal creations driven by large foundation models. While these technologies have expanded the boundaries of visual media, they also pose critical challenges for content understanding, quality assessment, and authenticity verification. Respectively, visual quality assessment aims to automatically predict the perceptual quality of media content from a human-centric perspective, while authenticity verification focuses on determining the genuineness of visual content. In response, this thesis presents a comprehensive investigation into AI-synthesized content assessment—covering the spectrum from perceptual evaluation to forensic analysis, from unimodal CNN-based deep models to large multimodal systems, and from opaque black-box predictions to trustworthy and explainable reasoning. The thesis consists of the following four parts, corresponding to corresponding to a progressive exploration of quality and authenticity assessment across diverse application scenarios of AI-generated visual content.In the first part, we investigate the visual quality of face videos compressed by generative video codecs. In specific, we construct a large-scale fine-grained video quality assessment (VQA) dataset containing 3,240 compressed face video sequences annotated with subjective quality scores. We benchmark the quality evaluation performance of some popular VQA methods and discovered their weakness on generatively compressed face videos, which challenge real-world applications. Based on the uniqueness of face video content, we further propose FAce VideO IntegeRity (FAVOR) index to measure the perceptual quality of face video compression, which exhibit its superior performance on the proposed dataset.
In the second part, we explore the quality of AI-generated images (AIGIs) in real-word application scenarios, as conventional IQA image quality assessment (IQA) models primarily focus on low-level visual perception, while existing IQA works on AIGIs overemphasize the generated content itself, neglecting the effectiveness in real-world The applications. To bridge this gap, we propose AIGI-VC, a quality assessment database focusing on the communicability of AIGIs in the advertising field. The evaluation contains the aspects of information clarity and emotional interaction, providing deeper insights in measuring the real-world usability of AIGIs.
In the third part, we probe the capability of large multimodal models (LMMs) in explainable AIGI detection, since current AIGI detection models and databases basically focus on binary classification without understandable explanations for the general populace. This weakens the credibility of authenticity judgment and may conceal potential model biases. Therefore, we pioneer the probe of LMMs for explainable AIGI detection by presenting a multimodal database encompassing textual authenticity descriptions, the FakeBench, which examines LMMs with four evaluation criteria: detection, reasoning, interpretation and fine-grained forgery analysis, to obtain deeper insights into image authenticity-relevant capabilities. This research presents a paradigm shift towards transparency for the fake image detection area and reveals the need for greater emphasis on forensic elements in visual-language research and AI risk control.
In the fourth part, we introduce a large multimodal expert model, FakeScope, for transparent AIGI detection, which not only identifies synthetic images with high accuracy but also delivers rich, interpretable, and query-contingent forensic insights. At the foundation of our approach is FakeChain, a large-scale dataset containing structured forensic reasoning based on visual trace evidence, constructed via a novel human-machine collaborative framework. Building upon this foundation, we develop FakeInstruct, the largest multimodal instruction tuning dataset to date, comprising two million visual instructions that instill nuanced forensic awareness into LMMs. Empowered by FakeInstruct, FakeScope achieves state-of-the-art performance in both closed-ended and open-ended forensic scenarios. Notably, despite being trained exclusively on qualitative hard labels, FakeScope demonstrates remarkable zero-shot quantitative capability on detection via our proposed token-based probability estimation strategy.
Overall, this thesis presents a comprehensive investigation for assessing AI-synthesized visual content, spanning from perceptual quality to authenticity. Through systematic dataset construction, novel quality measures, and the development of explainable multimodal forensic models, this work advances reliable evaluation across a range of real-world generated media scenarios. Extensive experimental results verify the effectiveness and superiority of these innovative approaches.
| Date of Award | 8 Aug 2025 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Shiqi WANG (Supervisor) |
Keywords
- AI-synthesized content
- visual quality assessment
- AI-generated image detection
- forensic investigation
- large multimodal models
Cite this
- Standard