The Application of Fine-grained Syntactic Features to Automatic Genre Classification


Student thesis: Doctoral Thesis

View graph of relations



Awarding Institution
Award date29 Aug 2017


Automatic Genre Classification (AGC) is one sub-task of Automatic Text Classification (ATC), which specially aims at labelling texts as pre-defined genre categories automatically by adopting Machine Learning (ML) algorithms. Since 1990s when computers started to prevail, ATC has gained an increasing interest of people covering a wide range of research areas (e.g. Natural Language Processing, Information Retrieval, Computational Linguistics) because of its popularity and usefulness in, such as Spam-filtering, Sentiment Analysis, Authorship Attribution and so on.
Related studies in ATC can be generally divided into two main streams: one is led by computer scientists who have been striving for more advanced ML algorithms; the other is led by linguists who are more interested to unearth potential linguistic features of discriminative power for ATC. With respect to the first stream, the past two more decades has witnessed its full-fledged evolvement on advancing the state-of-the-art ML algorithms, such as Naïve Bayes (NB), Support Vector Machines (SVM), Decision Trees (DT), K-Nearest-Neighboring (KNN), as well as Neural Networks (NNs) for Deep Learning (DL), into somehow a plateau now.
As for the second stream, that is, the employment of potential linguistic representations for ATC, there is so much wider space for progressing. Past related studies have extensively applied lexical and grammatical features to ATC. Very few studies are found on using syntactic features, and no studies have made use of internal structures, the exhaustive branching structures of constituents, which encode rich linguistic interpretations of structural orders, syntactic categories and syntactic functions. This is probably because of the lack of a specialized program for retrieving the wanted structures. This research has innovatively developed a retrieving method based on the Finite State Machine (FSM) and is hence able to identify, extract and compute the structural representations in preparing the syntactic features for the classifiers.
Moreover, there is a long-term controversy among the existing studies: some researchers believe that simple lexical features are significantly more discriminative than complex features (e.g. Multi-word Expressions, POS n-grams, semantic relations, and so on) for ATC, but some others believe vice versa. It is more reasonable for a linguist to believe that richer linguistic interpretations would provide more clues for a classifier to recognize a text category. The main problem lies in how well and reliably could an annotation scheme code the linguistic information in a computational sense. With respect to this concern, the ICE-GB corpus is adopted as the training and testing data because it has been regarded as the most sophisticatedly annotated corpus which was parsed by the Survey Parser with manual validation. Additionally, it accommodates the largest number of parse trees (83,394) that is unparalleled by any other existing English Corpora. This study is hence enabled to make use of a comprehensive set of fine-grained syntactic features for AGC. Besides, subsets of structural representations are also experimented in order to discover their individual discriminative power of telling genre categories. BOWs are used as the base line features. Naïve Bayes (NB) and Sequential Minimal Optimization (SMO) are adopted for building the learning models.
The classification experiments have generated very promising results of using fine-grained syntactic interpretations for AGC, and thus have answered to the controversy that complex features would help predict text categories more accurately. In addition, the experimental results and statistical tests have identified the different discriminative power of the various feature sets: structural features alone are not as discriminative as lexical features for AGC, but the discrepancy is not significant when a proper ML model is chosen; phrasal structures are found significantly more discriminative than clausal structures; main clauses are found significantly more discriminative than dependent clauses, but no significant discrepancy is found between subordinate clauses and embedded clauses; noun phrases are found the most discriminative phrasal structures among the seven types with significant discrepancy. At the same time, this paper has also discovered several lists of genre classes that correlate positively to the use of syntactic structures for AGC. For instance, the structural features always outperform lexical features in dialogues and public speeches which involve a high degree of interactions between the participants, or in unscripted speeches, spontaneous commentaries, and timed student scripts which are all generated under time pressure. The contextual factors (e.g. interaction, time pressure and setting) are believed to account for the addresser/writer’s employment of such distinctive structural expressions and hence contribute to their discriminativeness. Moreover, syntactic features are found to correlate positively to the SMO model, while lexical features are found to correlate positively to the NB model. The feature properties (i.e. feature dimensionality, data sparseness and independence degree) as well as the machine learning algorithms can account for such correlations.