The identification of stop words and keywords : a study of automatic term weighting in natural language text processing
停頓詞與關鍵詞的鑑別 : 關於自然語言文本處理中自動術語加權的研究
Student thesis: Master's Thesis
Related Research Unit(s)
|Award date||14 Jul 2006|
This thesis addresses two important problems related to term weighting in Natural Language Text Processing. The first problem is the identification of stop words in Chinese text processing, which focuses on automatically constructing a complete Chinese stop word list to save the time and release the burden of manual stop word selection. The second problem is the identification of keywords, which might be considered as an opposite question to stop words identification. These two problems are important in many fields related to text processing, for instance, information retrieval, text categorization and summarization, since they could greatly affect the experiment performances. Compared with English, even though being one of the languages, which are used by a large number of people all around the world, no Chinese stop words identification methods exist until now. The lack of spaces or other word delimiters and little diversity in the length of words in Chinese increase the difficulty of extracting stop words. In this thesis, we first investigate the Chinese segmentation problem, which is an inevasible process before stop words identification. We propose a unified segmentation algorithm for Chinese with web mining. Experiments prove that this algorithm outperforms traditional segmentation algorithms. With this better understanding of Chinese segmentation, we develop an efficient method for automatically extracting Chinese stop word lists afterwards. In our experiments, we construct a complete Chinese stop word list with a large corpus. In the meanwhile, we present several novel methodologies to evaluate the effectiveness of our Chinese stop word list, including applications of stop words in the field of automatic abstract extraction and word segmentation. Based on the study of Chinese stop words weighting, we look deep into several important keyword weighting schemes used nowadays in Natural Language Text Processing. Taking into consideration of an information factor “entropy” which describes the special characteristics of the keywords distribution, we propose a new term weighting scheme based on TF*IDF. Comprehensive comparisons of traditional schemes and ours are presented, which show that this scheme outperforms TF*IDF scheme in some circumstances.
- Text processing (Computer science), Natural language processing (Computer science)