An L1-L2 Parallel Dependency Treebank for Learner Chinese


Student thesis: Doctoral Thesis

View graph of relations



Awarding Institution
Award date20 Oct 2022


Treebanks are collections of syntactically analyzed sentences. A parallel treebank consists of sentence pairs that have been syntactically annotated and aligned at the sub-sentential level. While parallel treebanks have become available for many language combinations, there has been little effort in creating such a treebank for learner text, in which parse trees of sentences written by non-native speakers are aligned to parse trees of their target hypotheses. Although the use of treebanks in foreign language teaching and learning is not yet widespread, their potential in enhancing language pedagogy and computer-assisted language learning has been a research topic of growing interest.

This thesis presents a parallel treebank of learner Chinese, which consists of 4,748 sentences, including 1,552 learner sentences and their corrected versions. We present the dependency annotation scheme of the parallel treebank, based on the Universal Dependencies (UD) framework, as well as the procedure for its construction and inter-annotator agreement. We analyze the properties of learner Chinese revealed by this treebank, in terms of the syntactic structures overused and underused by learners. According to the log-likelihood statistic, the adverbializer and the sentence-final particle are underused, while parataxis and directional verb compounds are overused. The thesis then investigates, as an application of the treebank, the retrieval of sentences containing specific grammatical errors. The error categories related to questions, DE position, beneficiaries, and subsidiary relations achieved the highest precision, while those concerning noun modifiers and time expressions obtained the best recall. Further, we examine differences between minimal target hypotheses, which emphasize minimal editing, and fluent target hypotheses, which aim at native fluency. Fluent target hypotheses were found to be able to identify a wider range of distinctive learner usage in our treebank.

This thesis has advanced the study of learner language in two directions from the perspective of computer-assisted language learning. First, it has created a dependency treebank for learner Chinese by adapting the Universal Dependencies annotation scheme for Chinese. Second, it has further developed the treebank into an L1-L2 parallel treebank.

    Research areas

  • parallel treebank, corpus linguistics, Second language acquisition, Universal dependency, learner Chinese