Robust clustering methods for pattern classification

  • Shu Yan LAM

Student thesis: Doctoral Thesis

Abstract

Clustering has been widely used in many applications including data mining, pattern recognition and machine learning. Noise is a major problem in cluster analysis, which degrades the performance of many existing methods. This thesis is aimed at solving noise problems in data clustering and pattern detection. In Chapters 2 to 5, we discuss the data clustering problem, when both outliers and spurious repeated patterns are present in the dataset. These two noise corruptions cause problem controlling the support, which is the non-zero region of the function used for updating cluster centers in the optimization or estimation procedure. A local support based fuzzy c-means algorithm is developed in this thesis. This method adopts a progressive support reduction approach to locate the optimal solution. It consists of two stages. Initially, it uses a large support to locate the solution. The support of the necessary condition in this stage is robust to the presence of outliers and is able to escape from local trap state optima. Then, in the second stage, we use the solution from stage one as an initial guess and a smaller support avoid the noise effect yielded by the high dimensionality. Experiments show that this technique is able to resolve both of these problems and it outperforms 10 existing algorithms. Also, it shows more than 10% to 40% improvement in the classification error rates compared with many existing methods in real world datasets. In Chapters 4 to 6, data clustering problems, with different types of noise corruption in the feature space, are considered. There are two types of noise corruptions in high dimensional space: repeated patterns and low-density. These two problems reduce the reliability of the distance measures between two samples. Many researchers resolve these problems by pruning the corrupted dimensions. However, the use of dimensionality reduction may increase the classification error rate. This becomes a dilemma when pruning the dimension of a dataset. In this thesis, new techniques are developed to handle the problem without pruning the dimensions. In the problem of repeated patterns, locally supported algorithms are developed, which confine the support in the necessary condition so that the samples, which are far away from the cluster centers, make a small or even no contribution. This greatly reduces the effect of repeated patterns. In the low-density problem, a sub-dimension approach is developed, which partitions the data space into several overlapping subspaces. Then, the distance measures between two samples are made based on each of these sub-dimensions. This approach reduces the effect produced by the low-density problem. Experimental results show that these methods handle the high dimensionality problems effectively. In Chapters 7 and 8, noise in the pattern detection problem is studied and two different sample corruption problems are considered. In the first situation, the data object can be described by a parametric curve and the dataset itself is heavily corrupted by noise. A level set based affine transform method is developed to drive the user-defined curve towards the data object and avoid the influence produced by the corrupted samples. The second situation is that the data object is of arbitrary shape and its boundary is corrupted by noise. A generalized curve is developed to handle this complicated problem, which finds a curve passing through the middle of the dataset. Experiments show that these methods are robust to noise and are able to yield very good results.
Date of Award15 Jul 2008
Original languageEnglish
Awarding Institution
  • City University of Hong Kong
SupervisorHong YAN (Supervisor)

Keywords

  • Cluster analysis
  • Pattern recognition systems
  • Statistical decision

Cite this

'