The impact of the distance metric and measure on SMOTE-based techniques in software defect prediction

Research output: Journal Publications and Reviews (RGC: 21, 22, 62)21_Publication in refereed journalpeer-review

View graph of relations

Related Research Unit(s)

Detail(s)

Original languageEnglish
Article number106742
Journal / PublicationInformation and Software Technology
Volume142
Online published16 Oct 2021
Publication statusPublished - Feb 2022

Abstract

Context: In software defect prediction, SMOTE-based techniques are widely adopted to alleviate the class imbalance problem. SMOTE-based techniques select instances close in the distance to synthesize minority class instances, ensuring few noise instances are generated. 
Objective: However, recent studies show that selecting instances far away effectively increases the diversity and alleviates the overgeneralization brought by SMOTE-based techniques. To investigate the relationship between the distance of the selected instances and the performances of SMOTE-based techniques, we carry out this study. 
Method: We first conduct experiments to empirically investigate the impact of the distance between the instances on the performances of three common SMOTE-based techniques. Based on the experimental result, we improve a recently proposed oversampling technique-SMOTUNED. 
Results: The experimental results on five common classifiers across 30 imbalanced datasets from the PROMISE repository show that (1) the selection of the distance metric has little impact on the performances of SMOTE-based techniques, (2) as long as the number of synthesized noise instances is not beyond the noise-resistant ability of classifiers, the overall performances measured by AUC and balance of SMOTE-based techniques are not significantly affected by the distance between instances, and (3) the probability of detection (pd) and the probability of false alarm (pf) values of SMOTE-based techniques are significantly affected by the distance between the selected instances. The larger the distance between the selected instances is, the lower the pd and pf values SMOTE-based techniques obtain. The performance of the improved SMOTUNED is similar to that of the original SMOTUNED, but the improved SMOTUNED dramatically decreases the execution time of the original SMOTUNED. 
Conclusion: By controlling the distance, different pd and pf values can be obtained. The diversity of SMOTE-based techniques can be improved, and the overgeneralization can be avoided.

Research Area(s)

  • Class imbalance, Defect prediction, Distance metric, Empirical software engineering, Synthetic minority oversampling TEchnique