Relating fine-grained syntactic categories and functions to genre variation and automatic genre classification

Research output: Chapters, Conference Papers, Creative and Literary Works (RGC: 12, 32, 41, 45)32_Refereed conference paper (with ISBN/ISSN)peer-review

View graph of relations

Detail(s)

Original languageEnglish
Title of host publicationAPCLC3
Publication statusPublished - 21 Oct 2016

Conference

TitleThe 3rd Asia Pacific Corpus Linguistics Conference (APCLC 2016)
LocationBeihang University
PlaceChina
CityBeijing
Period21 - 23 October 2016

Abstract

Early corpus-based studies on genre variation and classification have seen many interesting empirical observations. Biber, D. (1988), for example, studied the similarities and differences among genres by focusing on 67 linguistic features; Moschitti, A. & Basili, R. (2004) investigated POS-tag, complex nominals, proper nouns and word senses for Text Classification; Zhang, W., et al. (2008) used multi-word expressions as document representatives for TC. These studies have all proved the significance of using linguistic features in discriminating texts of different genres. Nevertheless, the intensively investigated linguistic features mainly involve lexical and grammatical levels. Syntactic and higher level linguistic features (e.g. Semantic & Pragmatic levels) have not been exploited as much, partly because of the lack of sophisticated annotation scheme, and partly because of the notable computational difficulty in processing natural texts in such levels. In this paper, apart from fine-grained grammatical features, the syntactic features are particularly studied by adopting the ICE Corpus which was syntactically parsed by the Survey Parser (Fang, A.C., 1996). According to Fang, A.C. & Cao, J. (2015), fine-grained annotation can provide more useful linguistic features for high-performance NLP systems to enhance the prediction power in Automatic Genre Classification (AGC). The ICE Corpus has been regarded as the most sophisticatedly annotated corpus which can provide sufficient linguistic resources for the present study in this paper. In each parsing unit, besides grammatical information, both syntactic categories and functions are annotated, out of which the features of category-function combination, the n-gram function combination, and grammar-category-function combination are studied. Importantly, to testify the efficiency of using syntactic features for genre classification, machine learning algorithms are adopted in AGC tasks by feeding them to the state-of-the-art classifiers. Natural Language Toolkit (Bird, S., et al., 2009) and Weka (Hall, M., et al., 2009) are used as the language processing tools.

Bibliographic Note

Information for this record is provided by the author(s) concerned.