Semantics-Oriented Multitask DeepFake Detection with Model-and-Human in the Loop

Project: Research

View graph of relations

Description

The advent of deep generative models has greatly streamlined and automated the production ofrealistic counterfeit face images and videos, popularly known as DeepFakes. These models posea significant threat to the trustworthiness and authenticity of digital visual information. In response,researchers have devised a plethora of DeepFake detection tools aimed at automaticallyidentifying falsified face images and videos. Despite their proven effectiveness, these detectorsstill grapple with some fundamental issues that restrict their applicability to unanticipatedreal-world face manipulations.  First, current DeepFake detectors generally follow a manipulation-oriented approach, which encourages learning manipulation-specific features with limited generalizability. Second, when processing DeepFake videos, they predominantly take an image-based approach, aggregating frame-level predictions without appropriate spatiotemporal analysis. Furthermore, applying these methods to real-world videos, often several minutes long, is highly computationally inefficient. Last, existing DeepFake detectors are designed within a static setting, where the training and test sets are fixed. This results in a high likelihood of failure when faced with challenging real-world examples.  In this project, we first establish a set of DeepFake detection tasks based on face attributes at different semantic levels. This leads to semantics-oriented multitask learning of DeepFake detectors based on vision Transformers. The process of parameter sharing/splitting among tasks can be automated through joint embedding. The primary task (i.e., determining if an image is real or fake) can be prioritized using bi-level optimization. Next, we expand our DeepFake image detector to accommodate DeepFake videos by linearizing the attention computation in vision Transformers using learnable spatiotemporal queries. Meanwhile, we propose an efficient clip sampler that identifies discriminative video clips from long videos, enhancing practical DeepFake video detection. Last, we intend to expose challenging real-world examples of our DeepFake detectors by the maximum discrepancy principle. These examples will be annotated by human participants, and used for fine-tuning, leading to dynamic training of DeepFake detectors with model-and-human in the loop.  The preliminary results of our detectors are very promising. They surpass standard binary and C-way classification formulations of DeepFake detection, and also offer competitive detection performance compared to existing methods for both known and unknown face manipulations. Furthermore, dynamic training of DeepFake detectors with model-and-human in the loop indeed enhances DeepFake detection performance. The deliverables of this project will inspire new research avenues in the creation and detection of DeepFakes, making substantial contributions to the media forensics community. 

Detail(s)

Project number9043711
Grant typeGRF
StatusNot started
Effective start/end date1/01/25 → …