Analyzing who, what, and where in a mediaeval Chinese corpus: A case study on the Chinese Buddhist Canon

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 12 - Chapter in an edited book (Author)peer-review

Abstract

Information extraction from historical text is challenging because of the lack of data to train natural language processing tools. This chapter evaluates the utility of in-domain training data for data-driven profiling of characters, verbs, and toponyms and reports a case study on a corpus of Chinese Buddhist text. As is typical for such a corpus, the Chinese Buddhist Canon has few annotated linguistic resources other than lexica of names, places, and domain-specific terms. We apply a lexicon-based approach for named entity recognition and then report an analysis of the “who,” “what,” and “where” of the Canon: who the characters were, what they did, and where they were. Experimental results also show that even a small amount of word segmentation, part-of-speech, and dependency annotation can improve accuracy in named entity recognition and in extraction of character-verb associations.
Original languageEnglish
Title of host publicationAdvances in Corpus Applications in Literary and Translation Studies
EditorsRiccardo Moratto, Defeng Li
Place of PublicationLondon
PublisherRoutledge
Pages81-102
ISBN (Electronic)9781003298328
ISBN (Print)9781032287386, 9781032287409
DOIs
Publication statusPublished - 2023

Publication series

NameRoutledge Advances in Translation and Interpreting Studies

Fingerprint

Dive into the research topics of 'Analyzing who, what, and where in a mediaeval Chinese corpus: A case study on the Chinese Buddhist Canon'. Together they form a unique fingerprint.

Cite this