Skip to main navigation Skip to search Skip to main content

Unsupervised segmentation of Chinese corpus using accessor variety

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

Abstract

The lack of word delimiters such as spaces in Chinese texts makes word segmentation a special issue in Chinese text processing. As the volume of Chinese texts grows rapidly on the Internet, the number of unknown words increases accordingly. However, word segmentation approaches relying solely on existing dictionaries are helpless in handling unknown words. In this paper, we propose a novel unsupervised method to segment large Chinese corpora using contextual information. In particular, the number of characters preceding and following a string, known as the accessors of the string, is used to measure the independence of the string. The greater the independence, the more likely it is that the string is a word. The segmentation problem is then considered an optimization problem to maximize the target function of this number over all word candidates in an utterance. Our purpose here is to explore the best function in terms of segmentation performance. The performance is evaluated with the word token recall measure in addition to word type precision and word type recall. Among the three types of target functions that we have explored, polynomial functions turn out to outperform others. This simple method is effective in unsupervised segmentation of Chinese texts and its performance is highly comparable to other recently reported unsupervised segmentation methods. © Springer-Verlag Berlin Heidelberg 2005.
Original languageEnglish
Title of host publicationNatural Language Processing – IJCNLP 2004
Subtitle of host publicationFirst International Joint Conference, Hainan Island, China, March 22-24, 2004, Revised Selected Papers
EditorsKeh-Yih Su, Jun’ichi Tsujii, Jong-Hyeok Lee, Oi Yee Kwong
Place of PublicationBerlin, Heidelberg
PublisherSpringer 
Pages694-703
ISBN (Electronic)978-3-540-30211-7
ISBN (Print)978-3-540-24475-2
DOIs
Publication statusPublished - 2005
Event1st International Joint Conference on Natural Language Processing (IJCNLP 2004) - Hainan Island, China
Duration: 22 Mar 200424 Mar 2004

Publication series

NameLecture Notes in Artificial Intelligence
Volume3248
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference1st International Joint Conference on Natural Language Processing (IJCNLP 2004)
PlaceChina
CityHainan Island
Period22/03/0424/03/04

Fingerprint

Dive into the research topics of 'Unsupervised segmentation of Chinese corpus using accessor variety'. Together they form a unique fingerprint.

Cite this