Mining language variation using word using and collocation characteristics

Research output: Journal Publications and Reviews (RGC: 21, 22, 62)21_Publication in refereed journalpeer-review

3 Scopus Citations
View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Original languageEnglish
Pages (from-to)7805-7819
Journal / PublicationExpert Systems with Applications
Volume41
Issue number17
Publication statusPublished - 1 Dec 2014

Abstract

Two textual metrics "Frequency Rank" (FR) and "Intimacy" are proposed in this paper to measure the word using and collocation characteristics which are two important aspects of text style. The FR, derived from the local index numbers of terms in a sentences ordered by the global frequency of terms, provides single-term-level information. The Intimacy models relationship between a word and others, i.e. the closeness a term is to other terms in the same sentence. Two textual features "Frequency Rank Ratio (FRR)" and "Overall Intimacy (OI)" for capturing language variation are derived by employing the two proposed textual metrics. Using the derived features, language variation among documents can be visualized in a text space. Three corpora consisting of documents of diverse topics, genres, regions, and dates of writing are designed and collected to evaluate the proposed algorithms. Extensive simulations are conducted to verify the feasibility and performance of our implementation. Both theoretical analyses based on entropy and the simulations demonstrate the feasibility of our method. We also show the proposed algorithm can be used for visualizing the closeness of several western languages. Variation of modern English over time is also recognizable when using our analysis method. Finally, our method is compared to conventional text classification implementations. The comparative results indicate our method outperforms the others. © 2014 Elsevier Ltd. All rights reserved.

Research Area(s)

  • Frequency Rank Ratio, Language variation, Overall Intimacy, Text mining