Skip to main navigation Skip to search Skip to main content

Clustering ensemble selection for categorical data based on internal validity indices

Xingwang Zhao, Jiye Liang*, Chuangyin Dang

*Corresponding author for this work

    Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

    Abstract

    Clustering ensemble selection is an effective technique for improving the quality of clustering results. However, traditional methods usually measure the quality and diversity based on the cluster labels of base clusterings while missing the information of the original data. To solve this problem, a new clustering ensemble selection algorithm for categorical data is presented. In this algorithm, five popular internal validity indices and the normalized mutual information are utilized to measure the quality and diversity of the base clusterings, respectively. According to the quality measure, the partition with the highest value is firstly selected to participate in the ensemble. Then, the base partitions with the highest clustering quality and diversity with respect to the selected base partitions in previous iterations are iteratively selected, until the size of selected base clusterings is satisfied. The effectiveness and robustness of the proposed algorithm are evaluated in comparison with full ensemble, random selection ensemble and the state-of-the-art ensemble selection algorithms. Experimental results on real categorical data sets show that the proposed algorithm is competitive with the existing ensemble selection algorithms in terms of clustering quality.
    Original languageEnglish
    Pages (from-to)150-168
    JournalPattern Recognition
    Volume69
    Online published18 Apr 2017
    DOIs
    Publication statusPublished - Sept 2017

    Research Keywords

    • Categorical data
    • Clustering ensemble selection
    • Clustering validity indices
    • Diversity
    • Quality

    Fingerprint

    Dive into the research topics of 'Clustering ensemble selection for categorical data based on internal validity indices'. Together they form a unique fingerprint.

    Cite this