Abstract
The current study investigates how terminologically-informed features would contribute to automatic text classification. In particular, we examine the use of terms and term-related units as feature sets in different classification tasks. A sub-corpus of 80 texts was created out of the British component of the International Corpus of English. Three classification tasks were determined according to subject domains, registers and text categories. The performance of the selected feature sets was evaluated in terms of F-score through machine learning techniques. Such performance was also compared with that of conventional lexical and grammatical feature sets. Although it is a comparatively small corpus, the empirical results show that while features determined according to the lexical criterion have a consistent performance, the use of terms produced superior classification performance when classifying texts according to subject domains.
| Original language | English |
|---|---|
| Journal | CEUR Workshop Proceedings |
| Volume | 673 |
| Publication status | Published - 2010 |
| Event | EKAW 2010 Workshop 6: "Reuse and Adaptation of Ontologies and Terminologies 2010", EKAW-WS6 2010 - Lisbon, Portugal Duration: 15 Oct 2010 → 15 Oct 2010 |
Research Keywords
- Automatic text classification
- Feature generation
- Machine learning
- Terms
Fingerprint
Dive into the research topics of 'Use of terms and term-related units as feature sets for automatic text classification'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver