TY - GEN
T1 - Extending a thesaurus with words from Pan-Chinese sources
AU - Kwong, Oi Yee
AU - Tsou, Benjamin K.
PY - 2008
Y1 - 2008
N2 - In this paper, we work on extending a Chinese thesaurus with words distinctly used in various Chinese communities. The acquisition and classification of such region-specific lexical items is an important step toward the larger goal of constructing a Pan-Chinese lexical resource. In particular, we extend a previous study in three respects: (1) to improve automatic classification by removing duplicated words from the thesaurus, (2) to experiment with classifying words at the subclass level and semantic head level, and (3) to further investigate the possible effects of data heterogeneity between the region-specific words and words in the thesaurus on classification performance. Automatic classification was based on the similarity between a target word and individual categories of words in the thesaurus, measured by the cosine function. Experiments were done on 120 target words from four regions. The automatic classification results were evaluated against a gold standard obtained from human judgements. In general accuracy reached 80% or more with the top 10 (out of 80+) and top 100 (out of 1,300+) candidates considered at the subclass level and semantic head level respectively, provided that the appropriate data sources were used. © 2008. Licensed under the Creative Commons.
AB - In this paper, we work on extending a Chinese thesaurus with words distinctly used in various Chinese communities. The acquisition and classification of such region-specific lexical items is an important step toward the larger goal of constructing a Pan-Chinese lexical resource. In particular, we extend a previous study in three respects: (1) to improve automatic classification by removing duplicated words from the thesaurus, (2) to experiment with classifying words at the subclass level and semantic head level, and (3) to further investigate the possible effects of data heterogeneity between the region-specific words and words in the thesaurus on classification performance. Automatic classification was based on the similarity between a target word and individual categories of words in the thesaurus, measured by the cosine function. Experiments were done on 120 target words from four regions. The automatic classification results were evaluated against a gold standard obtained from human judgements. In general accuracy reached 80% or more with the top 10 (out of 80+) and top 100 (out of 1,300+) candidates considered at the subclass level and semantic head level respectively, provided that the appropriate data sources were used. © 2008. Licensed under the Creative Commons.
UR - http://www.scopus.com/inward/record.url?scp=80053405420&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-80053405420&origin=recordpage
U2 - 10.3115/1599081.1599139
DO - 10.3115/1599081.1599139
M3 - RGC 32 - Refereed conference paper (with host publication)
SN - 9781905593446
VL - 1
SP - 457
EP - 464
BT - Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference
T2 - 22nd International Conference on Computational Linguistics, Coling 2008
Y2 - 18 August 2008 through 22 August 2008
ER -