TY - JOUR
T1 - Determining the number of clusters using information entropy for mixed data
AU - Liang, Jiye
AU - Zhao, Xingwang
AU - Li, Deyu
AU - Cao, Fuyuan
AU - Dang, Chuangyin
PY - 2012/6
Y1 - 2012/6
N2 - In cluster analysis, one of the most challenging and difficult problems is the determination of the number of clusters in a data set, which is a basic input parameter for most clustering algorithms. To solve this problem, many algorithms have been proposed for either numerical or categorical data sets. However, these algorithms are not very effective for a mixed data set containing both numerical attributes and categorical attributes. To overcome this deficiency, a generalized mechanism is presented in this paper by integrating Rényi entropy and complement entropy together. The mechanism is able to uniformly characterize within-cluster entropy and between-cluster entropy and to identify the worst cluster in a mixed data set. In order to evaluate the clustering results for mixed data, an effective cluster validity index is also defined in this paper. Furthermore, by introducing a new dissimilarity measure into the k-prototypes algorithm, we develop an algorithm to determine the number of clusters in a mixed data set. The performance of the algorithm has been studied on several synthetic and real world data sets. The comparisons with other clustering algorithms show that the proposed algorithm is more effective in detecting the optimal number of clusters and generates better clustering results. © 2011 Elsevier Ltd. All rights reserved.
AB - In cluster analysis, one of the most challenging and difficult problems is the determination of the number of clusters in a data set, which is a basic input parameter for most clustering algorithms. To solve this problem, many algorithms have been proposed for either numerical or categorical data sets. However, these algorithms are not very effective for a mixed data set containing both numerical attributes and categorical attributes. To overcome this deficiency, a generalized mechanism is presented in this paper by integrating Rényi entropy and complement entropy together. The mechanism is able to uniformly characterize within-cluster entropy and between-cluster entropy and to identify the worst cluster in a mixed data set. In order to evaluate the clustering results for mixed data, an effective cluster validity index is also defined in this paper. Furthermore, by introducing a new dissimilarity measure into the k-prototypes algorithm, we develop an algorithm to determine the number of clusters in a mixed data set. The performance of the algorithm has been studied on several synthetic and real world data sets. The comparisons with other clustering algorithms show that the proposed algorithm is more effective in detecting the optimal number of clusters and generates better clustering results. © 2011 Elsevier Ltd. All rights reserved.
KW - Cluster validity index
KW - Clustering
KW - Information entropy
KW - k-Prototypes algorithm
KW - Mixed data
KW - Number of clusters
UR - http://www.scopus.com/inward/record.url?scp=84857042237&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-84857042237&origin=recordpage
U2 - 10.1016/j.patcog.2011.12.017
DO - 10.1016/j.patcog.2011.12.017
M3 - RGC 21 - Publication in refereed journal
SN - 0031-3203
VL - 45
SP - 2251
EP - 2265
JO - Pattern Recognition
JF - Pattern Recognition
IS - 6
ER -