Heterogeneous Data Analysis Based on Non-numerical Feature Transformation
基於非數值型特徵轉換的混合數據分析
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 25 May 2016 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(a2593c67-65e5-4ef3-9066-8db870a72d76).html |
---|---|
Other link(s) | Links |
Abstract
Heterogeneous data, which contain both numerical and non-numerical features, are commonly observed. Existing machine learning methods including feature subset selection, clustering and classification, are limited in analyzing heterogeneous data because numerical features contain the scale and probability information whereas non-numerical features only contain the probability information. Such difference in data format can reduce the performance of widely-used machine learning algorithms which were developed based on pure numerical data. Therefore, we have proposed novel feature transformation methods which can transform non-numerical features into numerical ones to unify the data format for feature subset selection, clustering and classification of heterogeneous data.
Conventional mutual information (MI) based feature selection (FS) methods are unable to handle heterogeneous data properly because of data format differences or estimation methods of MI between feature subset and class label. Therefore, a MI- based unsupervised feature transformation (UFT) which can transform non-numerical features into numerical features was developed. The UFT process is independent of class label and therefore suitable for feature subset selection. MI-based FS algorithms, such as Parzen window feature selector (PWFS), minimum redundancy maximum relevance feature selection (mRMR), and normalized MI feature selection (NMIFS), can all adopt UFT as the pre-processing procedure for non-numerical features. Meanwhile, the proposed UFT is unbiased while PWFS is utilized to its full advantages. Simulations and analysis of synthetic and benchmark datasets showed that feature subset selected by the integrated method, UFT-PWFS, outperformed other FT-FS integrated methods in classification accuracy by around 10%.
We have also observed that conventional centroid-based clustering algorithms for heterogeneous data result in different levels of inaccurate clustering. This is because the Hamming distance used for dissimilarity measurement of non-numerical values does not provide optimal distances between different values, and problems arise from attempts to combine the Euclidean distance and Hamming distance. We extended the use of MI-based UFT with the conventional k-means algorithm for heterogeneous data clustering. To original non-numerical features, UFT can provide numerical values which preserve the structure of the original non-numerical features and have the property of continuous values at the same time. Experiments and analysis of real-world benchmark datasets showed that the integrated UFT-k-means clustering algorithm outperformed others for heterogeneous data by around 14%.
Meanwhile, most conventional classification methods are not applicable to heterogeneous data. Therefore, a minimum probability error (MPE)-based feature transformation was developed to transform non-numerical features into numerical ones based on the information provided by a highly related numerical feature subset within the same dataset. In contrast to feature calibration and arbitrary setting, the proposed MPE transformation can keep the original structure of non-numerical features without unreliable information introduced. Simulations and analysis of real- world benchmark datasets have shown that the proposed MPE outperformed other conventional feature transformation methods in classification accuracy by around 4%.
Thus, the proposed UFT and MPE are promising feature transformation methods for the purposes of heterogeneous feature subset selection, clustering and classification accordingly. They are also convincing choices for large-scale real-world applications.
Conventional mutual information (MI) based feature selection (FS) methods are unable to handle heterogeneous data properly because of data format differences or estimation methods of MI between feature subset and class label. Therefore, a MI- based unsupervised feature transformation (UFT) which can transform non-numerical features into numerical features was developed. The UFT process is independent of class label and therefore suitable for feature subset selection. MI-based FS algorithms, such as Parzen window feature selector (PWFS), minimum redundancy maximum relevance feature selection (mRMR), and normalized MI feature selection (NMIFS), can all adopt UFT as the pre-processing procedure for non-numerical features. Meanwhile, the proposed UFT is unbiased while PWFS is utilized to its full advantages. Simulations and analysis of synthetic and benchmark datasets showed that feature subset selected by the integrated method, UFT-PWFS, outperformed other FT-FS integrated methods in classification accuracy by around 10%.
We have also observed that conventional centroid-based clustering algorithms for heterogeneous data result in different levels of inaccurate clustering. This is because the Hamming distance used for dissimilarity measurement of non-numerical values does not provide optimal distances between different values, and problems arise from attempts to combine the Euclidean distance and Hamming distance. We extended the use of MI-based UFT with the conventional k-means algorithm for heterogeneous data clustering. To original non-numerical features, UFT can provide numerical values which preserve the structure of the original non-numerical features and have the property of continuous values at the same time. Experiments and analysis of real-world benchmark datasets showed that the integrated UFT-k-means clustering algorithm outperformed others for heterogeneous data by around 14%.
Meanwhile, most conventional classification methods are not applicable to heterogeneous data. Therefore, a minimum probability error (MPE)-based feature transformation was developed to transform non-numerical features into numerical ones based on the information provided by a highly related numerical feature subset within the same dataset. In contrast to feature calibration and arbitrary setting, the proposed MPE transformation can keep the original structure of non-numerical features without unreliable information introduced. Simulations and analysis of real- world benchmark datasets have shown that the proposed MPE outperformed other conventional feature transformation methods in classification accuracy by around 4%.
Thus, the proposed UFT and MPE are promising feature transformation methods for the purposes of heterogeneous feature subset selection, clustering and classification accordingly. They are also convincing choices for large-scale real-world applications.