Unsupervised classification of biomedical abstracts using lexical association

Research output: Chapters, Conference Papers, Creative and Literary Works (RGC: 12, 32, 41, 45)32_Refereed conference paper (with ISBN/ISSN)peer-review

View graph of relations

Author(s)

Detail(s)

Original languageEnglish
Title of host publicationPACLIC 24 - Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation
Pages261-270
Publication statusPublished - 2010

Conference

Title24th Pacific Asia Conference on Language, Information and Computation, PACLIC 24
PlaceJapan
CitySendai
Period4 - 7 November 2010

Abstract

The task of text classification is the assignment of labels that describe texts' characteristics, such as topic, genre or sentiment. Supervised machine learning techniques such as Support Vector Machines or the simple but effective Naïve Bayes have been successfully applied to this task. However, it is not always practical to acquire a sufficient corpus of labelled examples to train these methods. For these cases we describe an unsupervised method for text classification based on two hypotheses. Firstly, we propose that the class of a document may be determined by calculating its constituent features' similarity with prototypical examples of each class. Secondly, we note the importance of class priors in Naïve Bayes classifiers, and hypothesize that class distributions might be estimated using the relative frequency of prototype words. Performing experiments on a corpus of biomedical abstracts with topic information derived from the Medical Subject Headings (MeSH), we investigate the characteristics of the method when used in conjunction with basic, linguistic and knowledge-based features, and find that the performance of the unsupervised method is approximately 80% that of Naïve Bayes. Our research is significant in that it highlights a candidate method with good potential for further improvement when training on unlabelled data. © 2010 by Jonathon Read, Jonathan Webster, and Alex Chengyu Fang.

Research Area(s)

  • Pointwise mutual information, Text classification, Unsupervised methods

Citation Format(s)

Unsupervised classification of biomedical abstracts using lexical association. / Read, Jonathon; Webster, Jonathan; Fang, Alex Chengyu.

PACLIC 24 - Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation. 2010. p. 261-270.

Research output: Chapters, Conference Papers, Creative and Literary Works (RGC: 12, 32, 41, 45)32_Refereed conference paper (with ISBN/ISSN)peer-review