Message passing for pattern recognition with bag-of-word representations
Student thesis: Doctoral Thesis
Related Research Unit(s)
|Award date||15 Jul 2014|
This dissertation studies novel message passing-based topic modeling techniques for pattern recognition with bag-of-word (BOW) representations. Recently, probabilistic topic models like latent Dirichlet allocation (LDA) have found many important applications in pattern recognition and machine learning. They represent documents (images and videos) by BOW features. Each document is composed of discrete (visual) words without ordering information. As a result, a collection of documents can be represented by a sparse BOW matrix, where the rows denote the vocabulary word indices, the columns denote the document indices, and the nonzero elements denote the word counts. Topic modeling can be explained as a labeling problem, in which a set of thematic topic labels are assigned to explain the observed nonzero elements in the BOW matrix. LDA evaluates the topic labeling configurations by the joint probability of observed words and latent topic labels. By maximizing the joint probability of LDA using approximate inference methods, we obtain the best topic labeling configuration for all nonzero elements in BOW matrix, which partitions the vocabulary words into several soft thematic groups called topics. From a collection of documents, LDA learns and outputs document-topic and word-topic distributions. The former is the document-specific topic proportion, and the latter is the topic-specific word proportion. Document-topic distributions are often used as document clustering results or lower-dimensional topic-based features for classification purposes, and word-topic distributions are often used to illustrate the "hot words to explain the semantic meaning of each topic. Both distributions are useful for many pattern recognition problems. In this dissertation, we develop novel message passing-based inference techniques to solve two problems of LDA: 1) scalability for big data, and 2) higher-order uncertainty in topic distributions. On the one hand, LDA's inference algorithms have very high time and space complexities scaling linearly with the number of nonzero elements and the number of topics. When the number of documents and the number of topics are very large, it is almost impossible to learn LDA on just a consumer-level PC. For example, on the publicly available PUBMED data set (8,200,000 documents, 141,043 vocabulary words, and 483,450,157 nonzero elements), the state-of-the-art inference algorithms of LDA require around two months to extract 10,000 topics. To speed up inference for LDA, we propose active message passing and online message passing techniques, which need only to scan the subset of nonzero elements and the subset of topics through dynamic scheduling based on message residuals. Empirical studies confirm our acceleration strategies for big data. For example, our online message passing algorithm can extract 10,000 topics from PUBMED using 2GB buffer on a common desktop computer with 4GB memory by around one day (29 hours), which is the fastest LDA algorithm in the world till now. On the other hand, topics may mean different things to different people. To account for higher-order uncertainty of topic distributions, we integrate LDA with type-2 fuzzy sets (T2 FS) referred to as type-2 fuzzy topic models, in which the primary membership grade of T2 FS describes the word uncertainty in topics, and the secondary membership grade of T2 FS describes the higher-order topic uncertainty. We develop the type-2 fuzzy message passing algorithms to learn type-2 fuzzy topic models, and confirm their effectiveness in human action recognition applications. To summarize, we develop variants of message passing inference techniques within the probabilistic topic modeling framework. These techniques are effective to handle big uncertain data in many real-world pattern recognition problems with BOW representations. In future, these techniques can be readily generalized to solve unsupervised feature or representation learning problems within the deep learning framework.
- Pattern recognition systems, Machine learning