Unstructured text mining via topic based semantic analysis

  • Jianfeng SI

Student thesis: Doctoral Thesis

Abstract

The ever-increasing explosion of the World Wide Web has seen a tremendous rise of unstructured text data, such as electronic publications, Wikipedia pages, product reviews, blogs/Microblogs, etc. These online resources are of great value with respect to various real-world applications to ontology learning, business intelligence, online advertisement and market prediction etc., which puts forward a strong demand to understand unstructured text into knowledge in an efficient way. This thesis focuses on topic based unstructured text mining, ranging from well edited text to discourse level text. Comparing to the traditional relational database applications, there is a big semantic gap between raw text and human understanding and also it is impossible to do any manual summarization, making it a bottleneck for further applications. Most of the web data are in an unstructured text format, i.e., they have no semantic structure. To fill the gap, we exploit topic modeling for knowledge mining from the unstructured text. The topic here is a kind of semantic pattern under text that reduces the original bag-of-words high dimensional word space into a lower meaningful topic space. The discovery of topics under unstructured text helps consume massive user generated contents to produce meaningful insights. We mainly study two kinds of unstructured text data, namely, knowledge based text and discourse level text data as follows: 1. Knowledge based text We regard these well edited text data such as electronic publications, Wikipedia web pages, etc. as knowledge based text. These well edited text data contain a lot human effort and can be regarded as very good learning resources or references. Although most of these resources can be retrieved from main search engines, which focus on the keyword based information retrieval, a hierarchical knowledge based semantic structure is needed to provide people with a global topic summarization and navigation. In this thesis, we learn a semantic topic tree from knowledge based text, which can serve as a high level summarization of text collections, and also as a knowledge navigation tool. An ideal semantic representation of the text corpus should exhibit a hierarchical topic tree structure, and topics residing at different node levels of the tree should exhibit different levels of semantic abstraction (i.e., the deeper level a topic resides, the more specific it would be). Instead of learning every node directly which is a quite time consuming task, our approach bases on a nonparametric Bayesian topic model, namely, Hierarchical Dirichlet Processes (HDP). By tuning on the topic's Dirichlet concentration parameter settings, two topic sets with different levels of abstraction are learned from the HDP separately and further integrated into a hierarchical clustering process. We term our approach as HDP Clustering (HDP-C). 2. Discourse level text TheWeb 2.0 technology brings an interactive interface between the people and the web, and also among people through the web, making the web activewith rich information. Now, the web especially the social media platforms have become ubiquitous platforms for social networking and content sharing. People's life experiences, opinions and beliefs could be well expressed or reflected in these user generated contents, which give researchers an unprecedented opportunity to do research on various applications. In this thesis, we are interested in two main kind of social media data as follows: (a) Product review Large volume of product review data can reveal consumers' major interests, which attracts great research attentions from the academic community. Most of the existing works focus on the problems of review summarization, aspect identification or opinion mining from an item's point of view such as the quality or popularity of products. Considering the fact that authors who generate those review texts should draw different attentions to product aspects with respect to their own interests, we aim to learn K users' interest groups indicated by their review writings. Such K interest groups' identification can facilitate better understanding of major and potential consumers' concerns which are crucial for applications like product improvement on customer-oriented design or diverse marketing strategies. Instead of using a traditional text clustering approach, we treat the groupId/clusterId as a hidden variable and introduce a permutation-based structural topic model called KMM. Through this model, we infer K interest groups' distribution by discovering not only the frequency of product aspects(TopicFrequency), but also the occurrence priority of respective aspects (Topic Order). They jointly present an informative summarization on the raw review corpus. (b) Microblogging data People's daily life experiences and opinions expressed or reflected in these user generated contents can be further aggregated to social signal, which are related to the socio-economic phenomena. The messages and the public sentiment contained in them have been studied for a wide range of applications like predicting pools, senate elections and various other socio-economic phenomena. In this thesis, we propose a technique to leverage topic based sentiments from Twitter to help predict the stock market. We first utilize a continuous Dirichlet Process Mixture model to learn the daily topic set. Then, for each topic we derive its sentiment according to its opinion words distribution to build a sentiment time series. We then regress the stock index and the Twitter sentiment time series to predict the market. Experiments on real-life S&P100 Index show that our approach is effective and performs better than existing state-of-the-art non-topic based methods. Additionally, we discover interesting entity co-occurrences in Twitter posts (e.g., stocks or companies represented by their stock ticker symbols) which reveal the mutual influence of stocks of participating entities. We exploit the usage of stock ticker symbols as cash-tags (i.e., "$" followed by ticker symbols) in Twitter posts and their co-occurrences to build a semantic stock network (SSN) (two stocks are connected when they co-occur in tweets frequently). Semantic topics are integrated into the network on both nodes and edges to enrich its expressiveness. This semantic stock network summarizes the main topics of interest discussed in social media. Motivated by the pairwise relationship between neighbor stocks in SSN, we compute a lexicon-based sentiment score for each node and edge from their social semantic context and further apply the vector autoregression (VAR) model to predict the movement of the specific stock price. We find that sentiments from its closest neighbors on the network can improve the prediction accuracy markedly, which do even better than using the sentiment of the target stock itself.
Date of Award2 Oct 2013
Original languageEnglish
Awarding Institution
  • City University of Hong Kong
SupervisorQing LI (Supervisor) & Xiaotie DENG (Co-supervisor)

Keywords

  • Text processing (Computer science)
  • Data mining
  • Semantic computing

Cite this

'