TY - CHAP
T1 - Analyzing who, what, and where in a mediaeval Chinese corpus
T2 - A case study on the Chinese Buddhist Canon
AU - Wong, Tak-sum
AU - Lee, John Sie Yuen
PY - 2023
Y1 - 2023
N2 - Information extraction from historical text is challenging because of the lack of data to train natural language processing tools. This chapter evaluates the utility of in-domain training data for data-driven profiling of characters, verbs, and toponyms and reports a case study on a corpus of Chinese Buddhist text. As is typical for such a corpus, the Chinese Buddhist Canon has few annotated linguistic resources other than lexica of names, places, and domain-specific terms. We apply a lexicon-based approach for named entity recognition and then report an analysis of the “who,” “what,” and “where” of the Canon: who the characters were, what they did, and where they were. Experimental results also show that even a small amount of word segmentation, part-of-speech, and dependency annotation can improve accuracy in named entity recognition and in extraction of character-verb associations.
AB - Information extraction from historical text is challenging because of the lack of data to train natural language processing tools. This chapter evaluates the utility of in-domain training data for data-driven profiling of characters, verbs, and toponyms and reports a case study on a corpus of Chinese Buddhist text. As is typical for such a corpus, the Chinese Buddhist Canon has few annotated linguistic resources other than lexica of names, places, and domain-specific terms. We apply a lexicon-based approach for named entity recognition and then report an analysis of the “who,” “what,” and “where” of the Canon: who the characters were, what they did, and where they were. Experimental results also show that even a small amount of word segmentation, part-of-speech, and dependency annotation can improve accuracy in named entity recognition and in extraction of character-verb associations.
UR - http://www.scopus.com/inward/record.url?scp=85143675960&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-85143675960&origin=recordpage
U2 - 10.4324/9781003298328-6
DO - 10.4324/9781003298328-6
M3 - RGC 12 - Chapter in an edited book (Author)
SN - 9781032287386
SN - 9781032287409
T3 - Routledge Advances in Translation and Interpreting Studies
SP - 81
EP - 102
BT - Advances in Corpus Applications in Literary and Translation Studies
A2 - Moratto, Riccardo
A2 - Li, Defeng
PB - Routledge
CY - London
ER -