Class discovery is an important task in pattern recognition whose goal is to classify
the data points into a number of classes. Given a data set P, the objectives of
class discovery are (i) to classify the data points into the corresponding clusters,
which maximizes the similarity between the points in one cluster, and maximizes the
dissimilarity between the points in the different clusters, (ii) to identify the number
of clusters. In order to achieve these goals, the clustering algorithms and the cluster
validity indices are necessary. As a result, both the clustering algorithms and cluster
validity indices are addressed in this thesis.
In practice, there exist two categories of datasets: one category of datasets is fea-
tured by a large number of low dimensional data points, such as pixels in a images,
while the other category of datasets is characterized by a small number of high dimen-
sional data points, such as the samples in microarray. The first category of datasets
pays more attention to the efficiency of the class discovery approach, while the second
category of datasets focuses on the accuracy of the class discovery approach. In order
to perform class discovery in these categories of datasets, we propose a fast cluster
algorithm combining with the traditional cluster validity index to discovery the un-
derlying structure of the first category of datasets as efficient as possible, and design
a general framework which integrates the data perturbation technique, the cluster
ensemble approach and the new cluster validity index to discover the classes from the
second category of datasets as accurate as possible.
First, we present a hashing based clustering algorithm (HBCA) which is a variant
of the K-means algorithm to classify a large number of low dimensional data points
in the first category of datasets efficiently. Unlike previous clustering algorithms,
HBCA places more emphasis on the running time of the algorithm. Specifically,
HBCA first hashes the data points into a set of histogram bins by a hashing function.
Then, it determines the initial centers of the clusters (regions) according to this point
distribution. Finally, HBCA performs clustering at the histogram bin level, rather
than the data point level. Two approaches are proposed to improve the performance
of HBCA further: (i) a shrinking process is performed on the histogram bins to reduce
the number of distance computations, and (ii) a hierarchical structure is constructed
to perform efficient indexing on the histogram bins. Finally, the performance of
HBCA is analyzed theoretically and experimentally. The experimental results show
that HBCA (i) can be easily implemented, (ii) identifies the clusters effectively and
(iii) outperforms most of the current state-of-the-art clustering approaches in terms
of efficiency when applied to image segmentation.
HBCA is also applied to solve the problem of the detection of visual attention re-
gions (VAR). We propose a rule based technique for the extraction of visual attention
regions at the object level based on HBCA, such that VAR detection can be per-
formed in a very efficient way. The proposed technique consists of four stages: (i) a
fast segmentation technique based on HBCA; (ii) a refined specification of VAR which
is known as the hierarchical visual attention regions (HVAR); (iii) a new algorithm
known as the rule based detection algorithm (RADA) to obtain the set of HVARs
in real time, and (iv) a new adaptive image display module and the corresponding
adaptation operations using HVAR. We also define a new background measure which
combines both feature contrast and the geometric property of the region to identify
the background region, and a confidence factor which is used to extract the set of
hierarchical visual attention regions. Compared with existing techniques, RADA has
two advantages: (i) The approach detects the visual attention region at the object
level, which bridges the gap between traditional visual attention regions and high-level
semantics; (ii) Our approach is efficient and easy to implement.
Second, we design a general framework for class discovery from the second category
of datasets which is featured by a small number of a very high dimensional data points.
The new framework integrates the data perturbation technique, the cluster ensemble
approach and the cluster validity index. Based on the new framework, we propose
two cluster validation approaches. The first approach integrates the random subspace
technique, the cluster ensemble approach based on the normalized cut and the Mod-
ified Rand Index, while the second approach integrates the noise injection technique,
the cluster ensemble based on Neural Gas and the Disagreement/Agreement (DA)
index. The experiments in the synthetic datasets and real datasets show that (i)
Both the Modified Rand Index and the Disagreement/Agreement index successfully
discovers the underlying structure from most of the synthetic datasets and the real
datasets, and (ii) both new indices outperform most of the state-of-the-art cluster
validity indices when applied to gene expression data.
| Date of Award | 15 Feb 2008 |
|---|
| Original language | English |
|---|
| Awarding Institution | - City University of Hong Kong
|
|---|
| Supervisor | Hau San WONG (Supervisor) |
|---|
- Algorithms
- Pattern recognition systems
Class discovery and its application
YU, Z. (Author). 15 Feb 2008
Student thesis: Doctoral Thesis