Skip to main navigation Skip to search Skip to main content

Disambiguating Names of Chinese Historical Figures in Local Gazetteers Digitally

Research output: Conference PapersRGC 32 - Refereed conference paper (without host publication)peer-review

Abstract

When integrating biographical data extracted from 2,000+ local gazetteers (difangzhi) into the China Biographical Database (CBDB), records of the same historical person has to be identified and linked—this is the procedure of “disambiguating” them in the datafication processes of biographical information for prosopographical databases. The said data was for populating CBDB, a relational database with biographical information about approximately 471k individuals (as of November 2020), which is meant to be useful for statistical, social network, spatial, and other kinds of analyses. Traditional Chinese naming customs pose big challenges to this disambiguation, however, given the number of identical names, especially for a local gazetteer dataset containing 0.12 million records and 90k unique names of government officials from imperial China. Also, useful variables are missing in numerous entries in those gazetteers. In my conference presentation, I lay out the solutions to disambiguating identical personal names in Chinese script. First, the individuals who repeatedly took official posts in the same locality are identified digitally and are then disambiguated. Second, the overlap of content in different gazetteers are cross-tabulated, and the overlapping entries in those titles are processed through this. Finally, the remaining data is corroborated with external datasets e.g. the China Government Employee Database – Qing (CGED-Q) developed by the Lee-Campbell research group. With these workflows, 51k personal names from premodern China are disambiguated with optimal precision and unprecedented efficiency. Such task is only possible if done digitally and serves as an example of what digital humanities could achieve for research on Chinese history. The techniques explored in this study will also be useful for disambiguation and Named Entity Recognition of other large-scale data in non-Latin script.
Original languageEnglish
Pages65
Publication statusPublished - May 2021
EventInternational Conference on Digital Representation and Research in Art, Humanities and Culture (DH 2020) - Hang Seng University of Hong Kong or Online, Hong Kong, China
Duration: 6 May 20217 May 2021
https://dh2020.hsu.edu.hk/

Conference

ConferenceInternational Conference on Digital Representation and Research in Art, Humanities and Culture (DH 2020)
Abbreviated titleDH2020
PlaceHong Kong, China
Period6/05/217/05/21
Internet address

Fingerprint

Dive into the research topics of 'Disambiguating Names of Chinese Historical Figures in Local Gazetteers Digitally'. Together they form a unique fingerprint.

Cite this