Deep Learning Based Facial Behavior Analysis


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date12 Jul 2023


Facial behavior analysis can facilitate human-computer interaction and many applications in the fields of health, education, security, customer services, and virtual reality. Conventional machine learning-based methods necessitate intricate designs for extracting handcrafted representations with dedicated purposes, whereas deep learning-based methods can acquire flexible and generalized representations. In this thesis, we investigated the use of deep learning techniques for various tasks in the field of facial behavior analysis, and addressed challenges raised in facial behavior analysis.

First, a cognition-inspired convolutional neural network (CNN) was proposed for recognizing facial expressions (FEs). According to the facial action coding system, FEs can be expressed as certain combinations of action units (AUs), so a good FE recognition method should learn these AU features inherently. However, small or low-intensity AUs may be easily overlooked, because there is a lack of local contextual information, and these weak AU features may not appear in the computation of deep feature maps. To overcome these limitations, dilated Inception blocks were used to extract rich representations with multiple kernel scales, which can aid in learning rich local contextual information. In addition, we introduced feature-guided auxiliary learning to utilize high-level semantic information to guide the learning for shallow layers, thereby enabling more effective use of multi-scale information. Moreover, the network utilized knowledge transferred from face recognition tasks to further enhance its performance in recognizing FEs. These procedures resulted in a method that can recognize FEs in a more systematic and effective manner by enhancing information from the kernel, network, and knowledge scales.

Second, a self-supervised motion learning method was proposed for recognizing facial micro-expressions (FMEs). FMEs are brief and unconscious facial movements that are difficult to detect with the naked eye and last for only a fraction of a second, in contrast with FEs, which are easily noticeable and identifiable movements that last for a longer duration. Although deep learning-based methods have achieved significant success in recognizing FMEs, they still require complex pre-processing step that uses conventional optical flow techniques to extract facial motions as inputs. To overcome this limitation, a novel framework was proposed by using self-supervised learning to directly extract facial motion for FMEs. However, this approach may overlook symmetrical facial actions on the left and right sides of a face when extracting fine features. To address this problem, a symmetric contrastive vision transformer was developed to constrain the learning of similar facial action features for the left and right sides of a face.

Third, deep motion retargeting-based methods were introduced for generating FMEs. One challenge in FME analysis is data scarcity, since its data are collected with a small number of subjects. Fortunately, generative models can help synthesize new images with desired FMEs. However, FMEs have subtle facial movements and limited training samples for feature learning. To address these issues, we developed an FME generation (FMEG) method using deep motion retargeting and transfer learning. Then, to improve the extraction and generation of these subtle facial movements, we developed an edge-aware motion-based FMEG method with an auxiliary edge prediction task and an edge-intensified multi-head self-attention module to aid in extracting and generating subtle features. The resulting methods exhibited a remarkable generalization ability, enabling the generation of FMEs with considerable domain gaps.

Lastly, due to the coronavirus disease 2019 or other diseases, wearing face masks has become necessary for personal health protection, raising a challenge in recognizing FEs. Therefore, an effective face mask detection (FMD) method was presented to detect faces and identify whether people are wearing face masks. The FMD method, which is based on a deep CNN, was designed to be lightweight, making it suitable for deployment on resource-constrained devices, such as mobile phones and embedded systems. To overcome the limitations of lightweight models with poorer representation than large models, a residual context attention module was introduced to extract rich context and focus on crucial face mask-related regions. In addition, a synthesized Gaussian heatmap regression module was employed to learn more discriminative representations for faces with and without masks.