Statistical learning techniques for rare event detection
Student thesis: Doctoral Thesis
Related Research Unit(s)
The increasing number and variety of sensors deployed across many applications presents us with exceptionally large data. These data streams contain information about various events. Certain events, though rare, translate to significant information. Detecting rare events is a critical task in a wide variety of application domains, including the detection of product defects, network intrusions and medical incidents. However, many challenges to rare event detection are encountered in practice. The challenges studied in this thesis have been abstracted from several critical problems reported in the field of in rare event detection, and the techniques for detecting rare events are examined in the framework of statistical learning. In statistical learning, rare event detection can be formulated as a supervised learning or unsupervised learning task, depending on whether the historical (labeled) data is available or not. In this thesis, we investigate the techniques of rare event detection from both perspectives, and propose several new approaches to detecting rare events in a timely and accurate manner. From the perspective of unsupervised learning, rare events are regarded as outliers that do not follow a well-defined notion of common behavior. Finding outliers by observing their distances to their k-th nearest neighbor is a popular non-parametric approach in outlier detection. Although many distance-based outlier detection algorithms have been proposed in literature, the computational efficiency of most of these worsens significantly as the size or dimensionality of the dataset increases. Large size and high dimensionality are now common characteristics of data generated in many real applications. Thus, computational efficiency and, in particular, scalability of computational performance to large, high dimensional datasets is a key factor in determining an algorithm's practical utility. In this thesis, we propose a novel distance-based algorithm for outlier detection that exhibits near-linear computation time with increasing dataset size and dimensionality. The algorithm involves a two-step procedure. The first is a preprocessing stage aimed at finding suspicious outliers, which helps speed up the second step that actually finds the true outliers. We provide theoretical justification for the near-linear time performance of the algorithm. We also present empirical comparisons with existing state-of-the-art outlier detection algorithms. From the perspective of supervised learning, rare event detection can be regarded as a problem of imbalanced classification. The training data is assumed to have signicantly fewer instances of rare events, called the minority class, compared to the majority class. Approaches to the imbalanced classification problem usually focus on rebalancing the class sizes by neglecting the effect of hidden structure within the majority class. We highlight the effect of sub-clusters within the majority class on detecting the minority instances and handle the imbalanced classification problem by learning the structure implicit in the data. We propose a decomposition-based approach involving classification by hidden structure learning (CHSL) to solve the two-class imbalanced classification problem. The approach first learns the hidden structure of the majority class using an unsupervised learning algorithm and then transforms the classification problem into several classification sub-problems. Next, the base classifier is constructed for each sub-problem. The ensemble is tuned to increase its sensitivity towards the minority class. We also provide a metric for estimating the stability of the decomposition, which appears necessary for good classifier performance. We demonstrate the performance of the proposed approach by applying it to a range of real datasets. As an extension, we present a specific form of CHSL embedding logistic regression and discuss its application in infectious disease prognosis. We demonstrate the performance of the approach using a real dataset on the prognosis of Methicillin-resistant Staphylococcus aureus (MRSA) acquisition. It is empirically shown that the prognosis ability of logistic regression on MRSA infection can be improved by applying a decomposition strategy coupled with tuning decision boundaries. Different sets of significant risk factors are detected by the proposed approach, which should be of assistance in infection control planning in practice. Finally, we propose a model-based clustering method for learning the hidden structure within the minority/majority class in an imbalanced classification problem, especially in a high dimensional setting. The proposed method is motivated by the regularized Gaussian mixture model (GMM). In a conventional GMM, the number of parameters increases rapidly and the maximum-likelihood estimation (MLE) problem becomes ill-posed as dimensionality increases. To overcome this difficulty, we propose a regularization method for parameter estimation in learning GMM. The regularization method guarantees a positive definite estimate of the inverse covariance matrix. More importantly, it finds a low-dimensional representation of the covariance matrix that enables the estimation of correlations. The optimal solution for the regularization problem can be arrived at efficiently by transforming the optimization into a determinant-maximization problem. The regularization method can be incorporated in the Expectation Maximization (EM) algorithm for maximizing the likelihood function of the GMM. The performance of the proposed method is demonstrated by analyzing several simulated datasets. We also provide a case study to illustrate the potential value of the proposed method in real application.
- Machine learning, Statistical methods, Data mining