Corpus-based topic diffusion for short text clustering

Research output: Research - peer-review21_Publication in refereed journal

View graph of relations

Related Research Unit(s)

Detail(s)

Original languageEnglish
Pages (from-to)2444-2458
Journal / PublicationNeurocomputing
Volume275
Early online date16 Nov 2017
StatePublished - 31 Jan 2018

Abstract

In this paper, we propose a novel corpus-based enrichment approach for short text clustering. Since sparseness brings about the problem of insufficient word co-occurrence and lack of context information, previous researches use external sources such as Wikipedia or WordNet to enrich the representation of short text documents, which requires extra resources and might lead to possible inconsistency. On the other hand, corpus-based approaches use no external information in mining short text data. By introducing a set of conjugate definitions to characterize the structures of topics and words, and by proposing a virtual generative procedure for short texts, we perform expansion on short text data. Specifically, new words which may not appear in a short text document were added with a virtual term frequency, and this virtual frequency is obtained from the posterior probabilities of new words given all the words in that document. The complete procedure can be regarded as mapping data points (documents) from the original feature space to a hidden semantic space (topic space). After performing semantic smoothing, data points are then mapped back to the original space. We conduct experiments on two short text datasets, and the results show that the proposed method can effectively address the sparseness problem. For these datasets, our method, using only a basic clustering algorithm, attains a comparable performance with methods based on enrichment with external information sources.

Research Area(s)

  • Clustering, Short text, Text enrichment, Text mining