Using phrases as features in email classification

Research output: Journal Publications and Reviews (RGC: 21, 22, 62)21_Publication in refereed journalpeer-review

33 Scopus Citations
View graph of relations

Author(s)

  • Matthew Chang
  • Chung Keung Poon

Related Research Unit(s)

Detail(s)

Original languageEnglish
Pages (from-to)1036-1045
Journal / PublicationJournal of Systems and Software
Volume82
Issue number6
Publication statusPublished - Jun 2009

Abstract

In this paper, we report our experience on the use of phrases as basic features in the email classification problem. We performed extensive empirical evaluation using our large email collections and tested with three text classification algorithms, namely, a naive Bayes classifier and two k-NN classifiers using TF-IDF weighting and resemblance respectively. The investigation includes studies on the effect of phrase size, the size of local and global sampling, the neighbourhood size, and various methods to improve the classification accuracy. We determined suitable settings for various parameters of the classifiers and performed a comparison among the classifiers with their best settings. Our result shows that no classifier dominates the others in terms of classification accuracy. Also, we made a number of observations on the special characteristics of emails. In particular, we observed that public emails are easier to classify than private ones. © 2009 Elsevier Inc. All rights reserved.

Research Area(s)

  • Document classification, Email, Naive Bayes, Nearest-neighbour, Resemblance

Citation Format(s)

Using phrases as features in email classification. / Chang, Matthew; Poon, Chung Keung.

In: Journal of Systems and Software, Vol. 82, No. 6, 06.2009, p. 1036-1045.

Research output: Journal Publications and Reviews (RGC: 21, 22, 62)21_Publication in refereed journalpeer-review