Skip to main navigation Skip to search Skip to main content

Clustering Heterogeneous Data with k-Means by Mutual Information-Based Unsupervised Feature Transformation

  • Min Wei*
  • , Tommy W. S. Chow
  • , Rosa H. M. Chan
  • *Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

90 Downloads (CityUHK Scholars)

Abstract

Traditional centroid-based clustering algorithms for heterogeneous data with numerical and non-numerical features result in different levels of inaccurate clustering. This is because the Hamming distance used for dissimilarity measurement of non-numerical values does not provide optimal distances between different values, and problems arise from attempts to combine the Euclidean distance and Hamming distance. In this study, the mutual information (MI)-based unsupervised feature transformation (UFT), which can transform non-numerical features into numerical features without information loss, was utilized with the conventional k-means algorithm for heterogeneous data clustering. For the original non-numerical features, UFT can provide numerical values which preserve the structure of the original non-numerical features and have the property of continuous values at the same time. Experiments and analysis of real-world datasets showed that, the integrated UFT-k-means clustering algorithm outperformed others for heterogeneous data with both numerical and non-numerical features.
Original languageEnglish
Pages (from-to)1535 - 1548
JournalEntropy
Volume17
Issue number3
DOIs
Publication statusPublished - 9 Mar 2015

Publisher's Copyright Statement

  • This full text is made available under CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/

Fingerprint

Dive into the research topics of 'Clustering Heterogeneous Data with k-Means by Mutual Information-Based Unsupervised Feature Transformation'. Together they form a unique fingerprint.

Cite this