Video Human Action Recognition Based on Deep Learning


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date22 Apr 2022


Video human action recognition is one of the key tasks in video understanding. It is defined as automatically understanding what actions are performed by humans in a video. In recent years, with the exponential growth of data on the Internet, effective and efficient analysis of videos has been becoming critical. It has a pretty assortment of applications in a myriad of fields such as intelligent surveillance, health care, human-computer interaction, robot learning, etc. Due to many challenges such as variations in illumination conditions and viewpoints, moving cameras, complex backgrounds, video human action recognition is a difficult problem. Thanks to the advancement of large-scale manually annotated video datasets and deep learning technologies, in recent years, remarkable progress has been achieved to learn discriminative spatio-temporal representations from labelled and unlabelled videos for video human action recognition.

Data augmentation is critical for deep learning-based human activity recognition. Human action recognition methods in videos based on deep convolutional neural networks usually use random cropping or its variants for data augmentation. However, this traditional data augmentation approach may generate many non-informative samples (video patches covering only a small part of the foreground or only the background) that are not related to a specific action. These samples can be regarded as noisy samples with incorrect labels, which reduces the overall action recognition performance. What is more, it turns out that people tend to pay more attention to motion information when recognizing activities. We attempt to enhance the motion information and mitigate the influence of noisy samples through a Siamese architecture, termed as Motion-patch-based Siamese Convolutional Neural Network (MSCNN). The term “motion patch” is defined as a specific square region that includes critical motion information in the video, and we propose a simple but effective method for selecting those regions. To evaluate the proposed MSCNN, we conducted a number of experiments on two popular datasets UCF-101 and HMDB-51. The mathematical model and experimental results verify that the proposed architecture is capable of enhancing the motion information and the architecture achieved comparable performance.

We conduct a deeper study to tackle the problem of generated noisy labels by traditional data augmentation methods. In the former work, we needed to extract optical flow to decide the motion area, which is computation unfriendly and not flexible. Besides, other saliency-based methods such as saliency maps for salient patch detection will suffer from the difference of dataset distribution. To address these issues, we attempt to use reinforcement learning for sampling effective training video patches using the video dataset itself without extra data and propose an Auto-augmented Siamese Neural Network (ASNet). In this framework, we propose backpropagating salient patches and randomly cropped samples in the same iteration to perform gradient compensation to alleviate the adverse gradient effects of non-informative samples. Salient patches refer to the samples containing critical information for human action recognition. The generation of salient patches is formulated as a Markov decision process, and a reinforcement learning agent called Salient Patch Agent (SPA) is introduced to extract patches in a weakly supervised manner without extra labels. Extensive experiments were conducted on two well-known datasets UCF-101 and HMDB-51 to verify the effectiveness of the proposed SPA and ASNet.

Although supervised video human activity recognition has been a great success, it is costly to annotate for a huge amount of data, which limits the deployment of deep models. How to learn effective video representations for HAR with a few annotations even without annotations is an important but challenging task. Without a large-scale annotated dataset for training, recently, self-supervised learning has been verified effective to be used from unlabelled data. It has the potential to utilize large-scale data, solve data shortage problems as well as reduce the cost of labelling data. Recent approaches mainly use contrastive learning or pretext tasks for spatio-temporal feature learning. However, these approaches only consider discriminating similar training samples from dissimilar ones, while ignoring the similarity degree of those samples. In this work, taking into account the degree of similarity of the sampled instances, we propose a novel pretext task - spatio-temporal overlap rate (STOR) prediction. It stems from the observation that humans can discriminate between the overlap rates of videos in space and time. This task encourages the model to discriminate the STOR of two generated samples to learn the representations. In addition, we employ a joint optimization of contrastive learning and pretext tasks to further enhance spatio-temporal feature learning and study the mutual influence of each component for designing. Extensive experiments demonstrate that our proposed STOR prediction task can favor both contrastive learning and pretext tasks. The joint optimization scheme can significantly improve the spatio-temporal feature learning in video understanding.

    Research areas

  • Deep Learning, Action Recognition, 3D CNN, Self-Supervised Learning