Project Details
Description
This proposal aims to create a treebank --- a database of syntactic parses of sentences
in a text corpus --- for the Chinese Buddhist Canon.
It is difficult to apply standard techniques for textual analysis to the Canon: with 52
million characters, divided into 1,514 individual texts, its sheer volume overwhelms the
community of scholars and students who can read it. Furthermore, since the texts
represent translations from Indic languages into Chinese from the 2nd to the 11th
centuries, the Canon has a complexity that is challenging.We need new techniques to study vast historical texts over time and space --- the
Canon has not only 1000 years of linguistic data but also metadata such as the year and
place of compilation and names of translators --- at a scale that was not feasible before
we had access to digital collections and computational methods. These techniques would
allow scholars to pose questions over the entire scope of the corpus such as, who
influenced whom, which concepts survived over time, and macroscopic, structural
patterns in the texts. Within this larger context, we propose to address two goals.Our first goal is to create a treebank of syntactically analyzed texts for a subset of the
Buddhist Canon. The past decade has seen many digitization and annotation projects
for historical texts ranging from classical Arabic to Middle English; we will be the first
to build a large-scale treebank of classical Chinese. In this pursuit, scholarship and
pedagogy are intertwined: scholars exploit treebanks for quantitative evidence to their
research, while students use them as reading support and contribute to them as a
learning exercise.Building treebanks is, however, labor intensive. Our second goal is to complete the
treebank for the rest of the Canon by leveraging recent advances in automatic natural
language parsing. A syntactic parser will be trained on the hand-crafted treebank from
the first goal, and then applied on the remaining texts. The resulting treebank for the
entire Canon will be openly available to the public.The Buddhist Canon has been an important object of scholarship for centuries. This
work advances the process, already underway, of applying emerging methods from
computational linguistics to the study of historical languages. It has the potential to
complement traditional research and pedagogical methodology with data-driven,
quantitative analyses, and gives scholars and students the opportunity to effectively
work with much larger volumes of texts than ever before.
Project number | 9041849 |
---|---|
Grant type | ECS |
Status | Finished |
Effective start/end date | 1/11/12 → 31/03/16 |
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.