TY - GEN
T1 - Unsupervised segmentation of Chinese corpus using accessor variety
AU - Feng, Haodi
AU - Chen, Kang
AU - Kit, Chunyu
AU - Deng, Xiaotie
PY - 2005
Y1 - 2005
N2 - The lack of word delimiters such as spaces in Chinese texts makes word segmentation a special issue in Chinese text processing. As the volume of Chinese texts grows rapidly on the Internet, the number of unknown words increases accordingly. However, word segmentation approaches relying solely on existing dictionaries are helpless in handling unknown words. In this paper, we propose a novel unsupervised method to segment large Chinese corpora using contextual information. In particular, the number of characters preceding and following a string, known as the accessors of the string, is used to measure the independence of the string. The greater the independence, the more likely it is that the string is a word. The segmentation problem is then considered an optimization problem to maximize the target function of this number over all word candidates in an utterance. Our purpose here is to explore the best function in terms of segmentation performance. The performance is evaluated with the word token recall measure in addition to word type precision and word type recall. Among the three types of target functions that we have explored, polynomial functions turn out to outperform others. This simple method is effective in unsupervised segmentation of Chinese texts and its performance is highly comparable to other recently reported unsupervised segmentation methods. © Springer-Verlag Berlin Heidelberg 2005.
AB - The lack of word delimiters such as spaces in Chinese texts makes word segmentation a special issue in Chinese text processing. As the volume of Chinese texts grows rapidly on the Internet, the number of unknown words increases accordingly. However, word segmentation approaches relying solely on existing dictionaries are helpless in handling unknown words. In this paper, we propose a novel unsupervised method to segment large Chinese corpora using contextual information. In particular, the number of characters preceding and following a string, known as the accessors of the string, is used to measure the independence of the string. The greater the independence, the more likely it is that the string is a word. The segmentation problem is then considered an optimization problem to maximize the target function of this number over all word candidates in an utterance. Our purpose here is to explore the best function in terms of segmentation performance. The performance is evaluated with the word token recall measure in addition to word type precision and word type recall. Among the three types of target functions that we have explored, polynomial functions turn out to outperform others. This simple method is effective in unsupervised segmentation of Chinese texts and its performance is highly comparable to other recently reported unsupervised segmentation methods. © Springer-Verlag Berlin Heidelberg 2005.
UR - https://www.scopus.com/pages/publications/26444614686
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-26444614686&origin=recordpage
U2 - 10.1007/978-3-540-30211-7_73
DO - 10.1007/978-3-540-30211-7_73
M3 - RGC 32 - Refereed conference paper (with host publication)
SN - 978-3-540-24475-2
T3 - Lecture Notes in Artificial Intelligence
SP - 694
EP - 703
BT - Natural Language Processing – IJCNLP 2004
A2 - Su, Keh-Yih
A2 - Tsujii, Jun’ichi
A2 - Lee, Jong-Hyeok
A2 - Kwong, Oi Yee
PB - Springer
CY - Berlin, Heidelberg
T2 - 1st International Joint Conference on Natural Language Processing (IJCNLP 2004)
Y2 - 22 March 2004 through 24 March 2004
ER -