Skip to main navigation Skip to search Skip to main content

Corpus-based topic diffusion for short text clustering

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

Abstract

In this paper, we propose a novel corpus-based enrichment approach for short text clustering. Since sparseness brings about the problem of insufficient word co-occurrence and lack of context information, previous researches use external sources such as Wikipedia or WordNet to enrich the representation of short text documents, which requires extra resources and might lead to possible inconsistency. On the other hand, corpus-based approaches use no external information in mining short text data. By introducing a set of conjugate definitions to characterize the structures of topics and words, and by proposing a virtual generative procedure for short texts, we perform expansion on short text data. Specifically, new words which may not appear in a short text document were added with a virtual term frequency, and this virtual frequency is obtained from the posterior probabilities of new words given all the words in that document. The complete procedure can be regarded as mapping data points (documents) from the original feature space to a hidden semantic space (topic space). After performing semantic smoothing, data points are then mapped back to the original space. We conduct experiments on two short text datasets, and the results show that the proposed method can effectively address the sparseness problem. For these datasets, our method, using only a basic clustering algorithm, attains a comparable performance with methods based on enrichment with external information sources.
Original languageEnglish
Pages (from-to)2444-2458
JournalNeurocomputing
Volume275
Online published16 Nov 2017
DOIs
Publication statusPublished - 31 Jan 2018

Research Keywords

  • Clustering
  • Short text
  • Text enrichment
  • Text mining

Fingerprint

Dive into the research topics of 'Corpus-based topic diffusion for short text clustering'. Together they form a unique fingerprint.

Cite this