Skip to main navigation Skip to search Skip to main content

Use of terms and term-related units as feature sets for automatic text classification

    Research output: Journal Publications and ReviewsRGC 22 - Publication in policy or professional journal

    Abstract

    The current study investigates how terminologically-informed features would contribute to automatic text classification. In particular, we examine the use of terms and term-related units as feature sets in different classification tasks. A sub-corpus of 80 texts was created out of the British component of the International Corpus of English. Three classification tasks were determined according to subject domains, registers and text categories. The performance of the selected feature sets was evaluated in terms of F-score through machine learning techniques. Such performance was also compared with that of conventional lexical and grammatical feature sets. Although it is a comparatively small corpus, the empirical results show that while features determined according to the lexical criterion have a consistent performance, the use of terms produced superior classification performance when classifying texts according to subject domains.
    Original languageEnglish
    JournalCEUR Workshop Proceedings
    Volume673
    Publication statusPublished - 2010
    EventEKAW 2010 Workshop 6: "Reuse and Adaptation of Ontologies and Terminologies 2010", EKAW-WS6 2010 - Lisbon, Portugal
    Duration: 15 Oct 201015 Oct 2010

    Research Keywords

    • Automatic text classification
    • Feature generation
    • Machine learning
    • Terms

    Fingerprint

    Dive into the research topics of 'Use of terms and term-related units as feature sets for automatic text classification'. Together they form a unique fingerprint.

    Cite this