Corpus-based Analysis of Mediæval Chinese Literature: Case Study on the Chinese Buddhist Canon


Student thesis: Doctoral Thesis

View graph of relations



Awarding Institution
Award date21 May 2018


Treebanks, collections of syntactically analyzed sentences, have been applied more and more frequently for data-mining in recent decades. This study built a treebank of the entire Tripiţaka Koreana, which is the edition of the Chinese Buddhist canon storing in Korea, to provide a tool for systematic analysis of the content therein. The treebank provides three levels of annotation, which also serves as an aid for reading the original text: word boundaries, parts-of-speech and dependency relations. In this dissertation, the author demonstrates the potential of applying this tool to analyze both the content and the language of the canon.

The Chinese Buddhist canon is a remarkable religious text, with over 250 million followers worldwide. The sheer volume of this text, of a total of approximately 50 million Chinese characters, poses huge difficulty for any research team, let alone individual scholar, to analyze the contents or linguistic features in even a small fraction of the whole canon. Fortunately, in this new era, corpora and treebanks provide tools for data-mining raw text, and thus a basis for distant reading. At present, large-scale treebanks of religious texts were compiled for only two of the major religions of the world. The current work contributes to the gap in the form of a dependency treebank of the Chinese Buddhist canon with limited manually annotated data. The treebank was then applied to analyze different aspects of the original text, namely, (i) the protagonists and locations, (ii) conversational networks, and (iii) diachronic syntax of Mediæval Chinese.

The author first built the Tripiţaka Koreana Treebank, of 46 million characters, using limited manually annotated data of 50 thousand characters only. The small-scale treebank of Chinese Buddhist text developed by Lee and Kong (2016) served as the training data to build the model for parsing the treebank produced in this study. For this small-scale treebank, the guidelines of the Penn Chinese Treebank by Xue et al. (2005) and the Stanford Dependencies for Chinese by Chang et al. (2009) were adopted, with five syntactic relations added for the grammatical structures existing in Classical Chinese but not in Modern Standard Chinese. The author trained a model by conditional random fields for automatic tagging of word boundary, and part-of-speech. A minimum-spanning tree parser was then trained to parse the treebank automatically. Syntactic trees of the whole Tripiţaka Koreana were thus successfully derived with information on word boundaries, parts-of-speech and dependency relations labelled. The whole Tripiţaka Koreana Treebank was released on the internet for public open access in CoNLL format.

The author also developed a methodology for the quantitative analysis of the protagonists and locations in a literary text by making use of techniques from natural language processing, and applied it on the treebank. The grammatical relations of the words were used to extract all the predicatives, their subjects and objects for further analysis. As a result, the most frequent verbs, and characters as nominal subjects were derived. The most significant differences between the three most popular epithets of Śākyamuni Buddha were also discovered by making use of the log-likelihood statistic. In addition to the protagonists, the most frequent locations appearing in the canon and where Śākyamuni Buddha visited were also derived. Furthermore, the Mahāyāna and Hīnayāna sections of the canon were also contrasted such that the significant differences were revealed in terms of the locations and characters. The full database of protagonist and location in Tripiţaka Koreana was released on the internet.

The above analysis shows that the most frequent verbs in the treebank in this study are quotative verbs. For this reason, the author also derived an algorithm to extract all the direct speeches in the treebank and analyzed the quotative verbs, hearers and listeners and thus the whole conversational network. As a result, statistics like the most frequent speakers, their most frequent listeners, the speech length, the most frequent quotative verbs that Buddha is the speaker or listener, were produced. The honorific use of quotative verbs was also discovered in the canon. This study thus made use of the special properties of these verbs to induce a hierarchy of characters. It was found that Śākyamuni Buddha occupies the top spot while the bodhisattvas rank second. The disciples, deities, and kings were considered as characters of lower status in the canon.

In addition to analyzing the content, this study also made use of this treebank to show one way of applying a dependency treebank to the study of diachronic linguistic research. The author studied the nature and genre of the Chinese Buddhist canon from the perspective of syntactic change of these constructions in the canon: (i) the use of demonstratives, (ii) classifier constructions, (iii) nominal constructions, (iiii) disposal construction, (v) prepositional phrases, and (vi) polar questions. It was found that the texts translated before the tenth century were vernacular in general. However, for those translated in the tenth and eleventh century, the language therein lacked many vernacular elements that were removed by the Stylists.

    Research areas

  • Buddhism, Chinese Buddhist canon, Classical Chinese literature, corpus linguistics, treebank, quantitative method