Synthetic data with neural machine translation for automatic correction in arabic grammar

Aiman Solyman, Wang Zhenyu*, Tao Qian, Arafat Abdulgader Mohammed Elhag, Muhammad Toseef, Zeinab Aleibeid

*Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

35 Citations (Scopus)

Abstract

The automatic correction of grammar and spelling errors is important for students, second language learners, and some Natural Language Processing (NLP) tasks such as part of speech and text summarization. Recently, Neural Machine Translation (NMT) has been an out-performing and well-established model in the task of Grammar Error Correction (GEC). Arabic GEC is still growing because of some challenges, such as scarcity of training sets and the complexity of Arabic language. To overcome these issues, we introduced an unsupervised method to generate large-scale synthetic training data based on confusion function to increase the amount of training set. Furthermore, we introduced a supervised NMT model for AGEC called SCUT AGEC. SCUT AGEC is a convolutional sequence-to-sequence model consisting of nine encoder-decoder layers with attention mechanism. We applied fine-tuning to improve the performance and get more efficient results. Convolutional Neural Networks (CNN) gives our model ability to joint feature extraction and classification in one task and we proved that it is an efficient way to capture features of the local context. Moreover, it is easy to obtain long-term dependencies because of convolutional layers staking. Our proposed model becomes the first supervised AGEC system based on the convolutional sequence-to-sequence learning to outperforms the current state-of-the-art neural AGEC models.

Original languageEnglish
Pages (from-to)303-315
Number of pages13
JournalEgyptian Informatics Journal
Volume22
Issue number3
Online published24 Dec 2020
DOIs
Publication statusPublished - Sept 2021
Externally publishedYes

Funding

This work was supported by the Science and Technology Program of Guangzhou, China (No. 201802010025), University Innovation and Entrepreneurship Education Fund Project of Guangzhou (No. 2019PT103), Guangdong Province Major Field Research and Development Program Project (No. 2019B010154004). QALB corpus is a collaborative project between Columbia University and the Carnegie Mellon University Qatar, funded by the Qatar National Research Fund. The data comes from the online commenters written at the Aljazeera Arabic news channel articles. The QALB at ANLPACL 2015, includes non-native data comes from Arabic Learners Written Corpus 3 3 (CERCLL) and some machine translation data obtained from Wikipedia articles translated in Arabic language using Google translation. The training dataset contains 2 million words annotated and corrected by native Arabic speakers. Alwatan Arabic news articles corpus contains 20,291 articles and 10,000,000 words are categorized into six groups collected from the Omani newspaper. In present study, the generated training data consisting of 18,061,610 million words in training and development sets. The fine-tuning set consists of training and development sets, and we used the QALB test set for testing. The whole training data consisting of 20,285,278 words and divided into synthetic and authentic sets, details shown in ( Table 3 ).

Research Keywords

  • Arabic grammar error correction
  • Convolutional neural networks
  • Natural language processing

Fingerprint

Dive into the research topics of 'Synthetic data with neural machine translation for automatic correction in arabic grammar'. Together they form a unique fingerprint.

Cite this