Skip to main navigation Skip to search Skip to main content

Extending a thesaurus with words from Pan-Chinese sources

Oi Yee Kwong, Benjamin K. Tsou

    Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

    Abstract

    In this paper, we work on extending a Chinese thesaurus with words distinctly used in various Chinese communities. The acquisition and classification of such region-specific lexical items is an important step toward the larger goal of constructing a Pan-Chinese lexical resource. In particular, we extend a previous study in three respects: (1) to improve automatic classification by removing duplicated words from the thesaurus, (2) to experiment with classifying words at the subclass level and semantic head level, and (3) to further investigate the possible effects of data heterogeneity between the region-specific words and words in the thesaurus on classification performance. Automatic classification was based on the similarity between a target word and individual categories of words in the thesaurus, measured by the cosine function. Experiments were done on 120 target words from four regions. The automatic classification results were evaluated against a gold standard obtained from human judgements. In general accuracy reached 80% or more with the top 10 (out of 80+) and top 100 (out of 1,300+) candidates considered at the subclass level and semantic head level respectively, provided that the appropriate data sources were used. © 2008. Licensed under the Creative Commons.
    Original languageEnglish
    Title of host publicationColing 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference
    Pages457-464
    Volume1
    DOIs
    Publication statusPublished - 2008
    Event22nd International Conference on Computational Linguistics, Coling 2008 - Manchester, United Kingdom
    Duration: 18 Aug 200822 Aug 2008

    Publication series

    Name
    Volume1

    Conference

    Conference22nd International Conference on Computational Linguistics, Coling 2008
    PlaceUnited Kingdom
    CityManchester
    Period18/08/0822/08/08

    Fingerprint

    Dive into the research topics of 'Extending a thesaurus with words from Pan-Chinese sources'. Together they form a unique fingerprint.

    Cite this