Machine Learning Techniques in Human Facial Expression and Action Recognition


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date13 May 2022


Automatic recognition of human facial expressions and body actions is widely employed in human-machine interaction, surveillance systems and augmented reality. Machine learning and deep learning techniques in human expression and action recognition have become an important area of research in computer vision as vital components of nonverbal communication.

Expression recognition relies on intensity, edges and geometry that overlook the actual shape curvatures of facial regions. The first stage of this work presents a novel two-stage approach to distinguish seven expressions based on 11 different facial areas. The contour and region harmonics were combined to develop the interrelationship of sublocal areas in the human face for expression recognition. We applied a multiclass support vector machine with subject-dependent k-fold cross-validation to classify human emotions into expressions. We tested our proposed method on three public facial expression datasets for sublocal regions in the human face. We achieved 94.90%, 93.43% and 92.57% recognition rates for the extended Cohn-Kanade, compound facial expressions of emotions and multimedia understanding group datasets, respectively. Experiments show that the contour and region harmonics have high classification power and can be computed efficiently. Our method provides higher accuracy, less computing time and less memory space than existing techniques do, including deep learning.

Facial expression recognition (FER) using a deep convolutional neural network (DCNN) is important and challenging. Although a substantial effort is made to increase FER accuracy through DCNN, previous studies are still not sufficiently generalisable for real-world applications. Traditional FER studies are mainly limited to controlled lab-posed frontal facial images, which lack the challenges of motion blur, head poses, occlusions, face deformations and lighting under uncontrolled conditions. In the second stage of this work, we proposed a SqueezExpNet architecture that can take advantage of local and global facial information for a highly accurate FER system that can handle environmental variations. Our network was divided into two stages: a geometrical attention stage that possesses a SqueezeNet-like architecture to obtain local highlight information and a spatial texture stage comprising several squeezed and expanded layers to exploit high-level global features. In particular, we created a weighted mask of 3D face landmarks and used element-wise multiplication with a spatial feature in the first stage to draw attention to important local facial regions. Next, we input the face spatial image and its augmentations into the second stage of the network. Finally, for classification, a recurrent neural network block was designed to collaborate the highlighted information from dual stages rather than simply using the softmax function, thereby aiding in overcoming the uncertainties. Experiments covering basic and compound FER tasks were performed using the three leading facial expression datasets. Our strategy outperformed the existing DCNN methods and achieved state-of-the-art results. The developed architecture, adopted research methodology and reported findings may find potential applications of real- time FER in surveillance, health and feedback systems.

Human skeleton-based action recognition in 3D sequences is an active research area in computer vision. The dynamics of human skeletons have received little attention in terms of semantics and joint connectivity. Graph convolutional networks (GCNs) have recently shown promising performance for this task because of their strengths in modelling the dynamics and dependencies of sequential data. The spatial and temporal dynamics of human body structures on a graph reflect the vital information for the particular action. However, the existing GCN methods do not have the adaptive and self-attentive ability towards the adjacency matrix. In the third stage of this work, we proposed a new class of GCN, namely, adaptive local and global context-aware and spatiotemporal self-attentive GCN, for skeleton-based action recognition. Adaptive and self-attentive GCN can focus on the important joints and bones in each frame by using local and global adaptive graph topology. We also introduced spatial and temporal self-attention graph mechanism to improve the capability further. The attention performance of our network was enhanced progressively with this mechanism. Extensive experiments on the three large-scale datasets demonstrate that the performance of the proposed model exceeds that of the state-of-the-art methods with a considerable margin.

Skeleton-based recognition of human actions has received attention in recent years because of the popularity of 3D acquisition sensors. Existing studies use 3D skeleton data from video clips collected from several views. The body view shifts from the camera perspective when humans perform certain actions, resulting in unstable and noisy skeletal data. Moreover, the possibility of self-occlusions complicates recognition. In the fourth stage of this work, we developed a view-adaptive (VA) mechanism that identifies the viewpoints across the action video sequence and transforms the skeleton view through a data-driven learning process to counteract the influence of variations. Most existing methods use a fixed human-defined prior criterion to reposition the skeletons. By contrast, we utilised an unsupervised reposition approach and jointly designed a VA neural network based on the graph neural network (GNN). Our VA-GNN model can transform the skeletons of distinct views into a considerably more consistent virtual perspective than comparative preprocessing approaches can. The VA module learns the best-observed view because it determines the most suitable view and transforms the skeletons from the action sequence for end-to-end recognition along with suited graph topology with adaptive GNN. Thus, our strategy reduces the influence of view variance, allowing networks to focus on learning action-specific properties and resulting in improved performance. The accuracy achieved by the experiments on the four benchmark datasets using our model is better and the number of parameters is lower than those of the state-of-the-art approaches, demonstrating the effectiveness of the proposed approach.

    Research areas

  • Contour description, region description, weighted mask, attention mechanism, adaptive graph, self-attention