A New Hierarchical EM Algorithm for Reducing Mixture Models

Project: Research

View graph of relations


With the advent of social networks and cloud computing, the analysis of constantly growing, large-scale data collections (“Big Data”) has become increasingly important. Standard machine learning tools, such as probabilistic generative models and the associated expectation maximization (EM) learning algorithm, have seen numerous successes in modeling and analyzing real-world data. However, EM faces computational bottlenecks (for scalability and efficiency) that limits the application of generative models to large-scale datasets, even though such models are well suited to deal with confounding factors, such as ambiguous data and noisy labels, typically present in such datasets.The hierarchical EM (HEM) algorithm offers a principled solution to such computational bottlenecks, by splitting the learning problem into two stages: 1) partition the data into smaller portions (e.g., a single image) and estimate an intermediate model on each portion using EM; 2) summarize the intermediate models into a final model with HEM. HEM is an extension of EM that takes abasemixture model (e.g., a Gaussian mixture model), and clusters the base mixture components (Gaussians) into groups, while learning a novel cluster center (another Gaussian) to represent each group. These cluster centers are the mixture components of areducedmixture model with fewer components. This is analogous to K-means clustering for real-vectors, except with HEM the data points are probability distributions. Current applications of HEM to large datasets include efficient hierarchical density estimation for image and music annotation, and building an indexing tree for fast retrieval.While HEM has proven successful for these applications, there are still limitations that need to be addressed before it can be applied elsewhere. First, the original formulation of HEM is a type ofmultiple-instance learning(MIL), which promotes commonly occurring mixture components, while suppressing outliers, thus distorting the reduced mixture model from the base density. This property of HEM is useful for estimating annotation models from weakly supervised data, where the goal to find the commonly occurring features present in the noisy training set. However MIL-based density estimation is not suitable when the goal is to estimate a reduced mixture model that best represents the density of the base mixture model. Second, applying HEM to a particular generative model requires deriving customized inference procedures, which is tedious and complicated. The goal of this project is to develop a new “density-preserving” hierarchical EM algorithm for hierarchical estimation. In particular, the project aims to develop an HEM algorithm that estimates a reduced mixture model that well represents the base mixture density, and a corresponding general HEM inference procedure for probabilistic graphical models. The result is a new HEM framework that can be easily applied to a diverse set of machine learning problems, including large-scale density estimation, Bayesian filtering, belief propagation, and fast nearest-neighbor search for densities.


Project number9041870
Grant typeGRF
Effective start/end date1/01/147/06/18