Abstract
The past decade has witnessed the tremendous success of automatic face recognition (AFR) due to its high-accuracy recognition performance and user-friendly usage. Nowadays, face information has become the dominant biometric trait of a person and a unique non-verbal but powerful FaceID. However, AFR systems are vulnerable to advanced face attacks, including digital face attacks and physical face attacks. Due to the unrestricted access to face images/videos and disclosed powerful face manipulation tools circulating on the internet, face attacks can be easily launched by non-expert hackers, causing pressing security concerns. This thesis focuses on safeguarding personal facial information against digital and physical attacks. It mainly consists of five parts: 1) An effective model is designed to efficiently detect manipulated digital face images while simultaneously locating the manipulated regions; 2) We design a ViT-based face forgery detector using Low Rank Adaptation (LoRA) modules to achieve more general forgery detection; 3) We design a novel paradigm that seeks to reveal the authentic face hidden behind the fake one by leveraging the joint information of face and audio; 4) A cost-effective and acoustic-based face anti-spoofing (FAS) system for mobile devices is developed, which employs the crafted acoustic signal as the probe to counter physical face presentation attacks; 5) A multi-modal mobile FAS system named M3FAS is designed to perform more accurate and robust face presentation attack detection.In the first part, we propose a conceptually simple but effective method to efficiently detect forged faces while simultaneously locating the manipulated regions. The proposed scheme relies on a segmentation map that delivers meaningful high-level semantic information clues about the image. Furthermore, a noise map is estimated, playing a complementary role in capturing low-level clues and subsequently powering the decision-making. Finally, the features from these two modules are combined to distinguish fake faces. Extensive experiments show that the proposed model can achieve outstanding detection accuracy and remarkable localization performance.
In the second part, we design a more general fake face detection model based on the vision transformer(ViT) architecture. Despite the demonstrated success in intra-domain face forgery detection, existing detection methods tend to suffer from dramatic performance drops when deployed to unforeseen domains. To mitigate this issue, we propose to use the Low-Rank Adaptation(LoRA) modules to improve the model's generalization capability. In the training phase, the pretrained ViT weights are freezed, and only the LoRA modules are updated. Additionally, the Single Center Loss(SCL) is applied to supervise the training process, further improving the generalization capability of the model. The proposed method achieves state-of-the-arts detection performances in both cross-manipulation and cross-dataset evaluations.
In the third part, we propose a new paradigm that seeks to reveal the authentic face hidden behind the fake one by leveraging the joint information of face and audio. More specifically, given the fake face and the audio segment, the cross-modality transferable capability is exploited by learning to generate the feature of the authentic face, based on the underlying clues from the audio as well as the fake face appearance. The effectiveness of the proposed scheme is validated through a series of evaluations, and experimental results show that the proposed model achieves promising face reconstruction performance in revealing the hidden faces, in terms of reconstruction quality, identity retrieval performance, and face attribute inference accuracy.
In the fourth part, we devise a novel and cost-effective FAS system based on the acoustic modality, named Echo-FAS, which employs the crafted acoustic signal as the probe to perform face liveness detection. We first propose to build a large-scale, high-diversity, and acoustic-based FAS database, Echo-Spoof. Then, based upon Echo-Spoof, we propose designing a novel two-branch framework that combines the global and local frequency clues to distinguish inputs, live vs. spoofing faces. The devised Echo-FAS comprises the following merits: 1) It only needs one available speaker and microphone as sensors while not requiring any expensive hardware; 2) It can successfully capture the 3D geometrical information of input queries and achieve a remarkable face anti-spoofing performance. Our proposed Echo-FAS provides new insights regarding the development of FAS systems for mobile devices.
In the fifth part, we devise an accurate and robust multi-modal mobile face anti-spoofing system named M3FAS. Thanks to the pervasive availability of cameras, speakers and microphones on mobile devices, the designed framework assembles the RGB and acoustic data via hierarchical feature aggregation modules to perform robust FAS. Besides, M3FAS simultaneously outputs three predictions from the heads of vision, acoustic, and fusion channels, and we find that the proposed multi-head training strategy effectively boosts the performance and makes the model more flexible. M3FAS can achieve a 99.9% AUC and 98.3% ACC face liveness detection performance, and extensive experiments have demonstrated the robustness of M3FAS under various challenging experimental settings.
Overall, this thesis contributes to building the integrity of face forensics from the following two aspects: 1) To counter digital face attacks, we first propose a two-stream framework to achieve accurate face manipulation detection and localization; Then, we design a ViT-based forgery detector using Low-Rank Adaptation(LoRA) modules to improve the model's generalization capability; Finally, we make a further step to reconstruct the authentic facial information behind the fake videos by fusing the features of authentic audios and fake faces; 2) To defense AFR systems against physical face presentation attacks, we first design an acoustic-based system for mobile devices to perform accurate and cost-effective face anti-spoofing; Then, we combine acoustic and RGB modalities to further improve the robustness and accuracy of face liveness detection.
Date of Award | 1 Aug 2023 |
---|---|
Original language | English |
Awarding Institution |
|
Supervisor | Shiqi WANG (Supervisor) |
Keywords
- Digital face attacks
- physical face attacks
- face forgery detection
- presentation attack detection
- multi-modal learning
- optimization