Sign Language Recognition Based on R(2+1)D With Spatial-Temporal-Channel Attention
Research output: Journal Publications and Reviews (RGC: 21, 22, 62) › 21_Publication in refereed journal › peer-review
Author(s)
Related Research Unit(s)
Detail(s)
Original language | English |
---|---|
Journal / Publication | IEEE Transactions on Human-Machine Systems |
Online published | 2 Feb 2022 |
Publication status | Online published - 2 Feb 2022 |
Link(s)
DOI | DOI |
---|---|
Document Link | |
Link to Scopus | https://www.scopus.com/record/display.uri?eid=2-s2.0-85124184340&origin=recordpage |
Permanent Link | https://scholars.cityu.edu.hk/en/publications/publication(051d2b27-6f70-4198-ac23-338cef50bf67).html |
Abstract
Previous work utilized three-dimensional (3-D) convolutional neural networks (CNNs) tomodel the spatial appearance and temporal evolution concurrently for sign language recognition (SLR) and exhibited impressive performance. However, there are still challenges for 3-D CNN-based methods. First, motion information plays a more significant role than spatial content in sign language. Therefore, it is still questionable whether to treat space and time equally and model them jointly by heavy 3-D convolutions in a unified approach. Second, because of the interference from the highly redundant information in sign videos, it is still nontrivial to effectively extract discriminative spatiotemporal features related to sign language. In this study, deep R(2+1)D was adopted for separate spatial and temporal modeling and demonstrated that decomposing 3-D convolution filters into independent spatial and temporal convolutions facilitates the optimization process in SLR. A lightweight spatial-temporal-channel attention module, including two submodules called channel-temporal attention and spatial-temporal attention, was proposed to make the network concentrate on the significant information along spatial, temporal, and channel dimensions by combining squeeze and excitation attention with self-attention. By embedding this module into R(2+1)D, superior or comparable results to the state-of-the-art methods on the CSL-500, Jester, and EgoGesture datasets were obtained, which demonstrated the effectiveness of the proposed method.
Research Area(s)
- Convolution, Feature extraction, Videos, Hidden Markov models, Gesture recognition, Task analysis, Spatiotemporal phenomena, Attention mechanism, R(2+1)D, sign language recognition (SLR), SCALE GESTURE RECOGNITION, FRAMEWORK, FUSION
Citation Format(s)
Sign Language Recognition Based on R(2+1)D With Spatial-Temporal-Channel Attention. / Han, Xiangzu; Lu, Fei; Yin, Jianqin et al.
In: IEEE Transactions on Human-Machine Systems, 02.02.2022.
In: IEEE Transactions on Human-Machine Systems, 02.02.2022.
Research output: Journal Publications and Reviews (RGC: 21, 22, 62) › 21_Publication in refereed journal › peer-review