Enhanced Representation for Short Text Clustering from Multiple Perspectives


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date14 Feb 2018


With the development of social media applications, short text mining is becoming increasingly important. One of the main problems in short text mining is that the succinct form of feature representation yields sparseness. Due to the sparseness, both the feature correlation information (word co-occurrence) and data contiguity information (context information) are less reliable. In this setting, most existing text mining methods, which are devised to regular text data, are less efficient in short text mining tasks. To alleviate sparseness in short text clustering, considerable researches utilize external information such as WordNet or Wikipedia to enrich feature representations, which requires extra works and resources. Furthermore, according to our observation, some enrichments are conducted based on the assumption that both short text and external information are sampled from the same knowledge domain, ignoring the fact that short text may not be uniformly sampled as that of external information. Thus it would be unreasonable to assume that the topical structure of the two domains is completely identical. In other words, enrichment based on bias topical structure may lead to inconsistency. In this thesis, we focus on alleviating the sparseness using three feature strengthening-based approaches, including two term-weighting approaches and one corpus-based enrichment approach. These three approaches strengthen the original feature representations by weighting discriminative terms or adding closely related terms as contextual information, by making use of semantic groups at three different levels.

Firstly, we observe that discriminative terms distribute in a non-uniform way among different domains, while background words have a tendency to distribute uniformly. This observation can be measured by a suitably defined functional of a term’s probability distribution over different domains. Thus discriminative terms can be separated from background and noisy words. This separation is performed based on semantic groups of domain-level granularity, which are initially estimated by the primary clustering structure of data.

Secondly, as the sparseness leads to weak connections between short texts, the semantic similarity between short texts is difficult to be measured. We introduce a special term-specific document set-potential locality set-to capture weak similarity. Specifically, for any two short documents within the same potential locality, the Jaccard similarity between them is greater than 0. In other words, the adjacency graph based on these weak connections is a complete graph. In locality-sensitive term weighting scheme, a term associated with a tighter locality would be assigned with a larger weight. Using the proposed weighting scheme, term-level semantics are associated with densities of certain data areas.

Thirdly, we propose a novel corpus-based enrichment approach which provides richer context for short texts without external knowledge. By introducing a set of conjugate definitions to characterize the structures of topics and words, and by proposing a virtual generative procedure for short texts, we perform expansion on short text data. Specifically, new words which may not appear in a short text document are added with a virtual term frequency, and this virtual frequency is obtained from the posterior probabilities of new words given all the words in that document. The complete procedure can be regarded as mapping data points (documents) from the original feature space to a hidden semantic space (topic space). After performing semantic smoothing, data points are then mapped back to the original space.

Overall, all three approaches stated above focus on enhancing representations of short texts. With the motivation kept in the mind that filters out discriminative terms and further inhibits noises, we design the iterative term weighting scheme by using their different distribution patterns with respect to natural clusters of data. The resulting top weighted terms provide us better insights into in-domain semantic structures in short texts. Meanwhile, they also make us aware that topic information tends to be fragmental in short texts, and connections between related short texts are generally weak. To alleviate this problem, we propose a corpus-based topic diffusion approach to provide richer contextual information for short texts. In addition, we develop a simple but effective method in which weak similarity information is carefully preserved by potential localities and connections between related short texts are strengthened, which can further be used to facilitate term weighting. Therefore, we consider that these three types of enhancement approaches highlight structures at different semantic levels. When properly utilized, all these three approaches can be integrated into a classic natural language model like LDA and obtain a satisfactory solution.