Abstract
The automatic correction of grammar and spelling errors is important for students, second language learners, and some Natural Language Processing (NLP) tasks such as part of speech and text summarization. Recently, Neural Machine Translation (NMT) has been an out-performing and well-established model in the task of Grammar Error Correction (GEC). Arabic GEC is still growing because of some challenges, such as scarcity of training sets and the complexity of Arabic language. To overcome these issues, we introduced an unsupervised method to generate large-scale synthetic training data based on confusion function to increase the amount of training set. Furthermore, we introduced a supervised NMT model for AGEC called SCUT AGEC. SCUT AGEC is a convolutional sequence-to-sequence model consisting of nine encoder-decoder layers with attention mechanism. We applied fine-tuning to improve the performance and get more efficient results. Convolutional Neural Networks (CNN) gives our model ability to joint feature extraction and classification in one task and we proved that it is an efficient way to capture features of the local context. Moreover, it is easy to obtain long-term dependencies because of convolutional layers staking. Our proposed model becomes the first supervised AGEC system based on the convolutional sequence-to-sequence learning to outperforms the current state-of-the-art neural AGEC models.
Original language | English |
---|---|
Pages (from-to) | 303-315 |
Number of pages | 13 |
Journal | Egyptian Informatics Journal |
Volume | 22 |
Issue number | 3 |
Online published | 24 Dec 2020 |
DOIs | |
Publication status | Published - Sept 2021 |
Externally published | Yes |
Funding
This work was supported by the Science and Technology Program of Guangzhou, China (No. 201802010025), University Innovation and Entrepreneurship Education Fund Project of Guangzhou (No. 2019PT103), Guangdong Province Major Field Research and Development Program Project (No. 2019B010154004). QALB corpus is a collaborative project between Columbia University and the Carnegie Mellon University Qatar, funded by the Qatar National Research Fund. The data comes from the online commenters written at the Aljazeera Arabic news channel articles. The QALB at ANLPACL 2015, includes non-native data comes from Arabic Learners Written Corpus 3 3 (CERCLL) and some machine translation data obtained from Wikipedia articles translated in Arabic language using Google translation. The training dataset contains 2 million words annotated and corrected by native Arabic speakers. Alwatan Arabic news articles corpus contains 20,291 articles and 10,000,000 words are categorized into six groups collected from the Omani newspaper. In present study, the generated training data consisting of 18,061,610 million words in training and development sets. The fine-tuning set consists of training and development sets, and we used the QALB test set for testing. The whole training data consisting of 20,285,278 words and divided into synthetic and authentic sets, details shown in ( Table 3 ).
Research Keywords
- Arabic grammar error correction
- Convolutional neural networks
- Natural language processing