Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction

Research output: Journal Publications and Reviews (RGC: 21, 22, 62)21_Publication in refereed journalpeer-review

View graph of relations

Related Research Unit(s)

Detail(s)

Original languageEnglish
Article number106662
Journal / PublicationInformation and Software Technology
Volume139
Online published17 Jun 2021
Publication statusOnline published - 17 Jun 2021

Abstract

Context: In practice, software datasets tend to have more non-defective instances than defective ones, which is referred to as the class imbalance problem in software defect prediction (SDP). Synthetic Minority Oversampling TEchnique (SMOTE) and its variants alleviate the class imbalance problem by generating synthetic defective instances. SMOTE-based oversampling techniques were widely adopted as the baselines to compare with the newly proposed oversampling techniques in SDP. However, randomness is introduced during the procedure of SMOTE-based oversampling techniques. If the performance of SMOTE-based oversampling techniques is highly unstable, the conclusion drawn from the comparison between SMOTE-based oversampling techniques and the newly proposed techniques may be misleading and less convincing. 
Objective:
This paper aims to investigate the stability of SMOTE-based oversampling techniques. Moreover, a series of stable SMOTE-based oversampling techniques are proposed to improve the stability of SMOTE-based oversampling techniques. 
Method:
Stable SMOTE-based oversampling techniques reduce the randomness in each step of SMOTE-based oversampling techniques by selecting defective instances in turn, distance-based selection of K neighbor instances, and evenly distributed interpolation. Besides, we mathematically prove and also empirically investigate the stability of SMOTE-based and stable SMOTE-based oversampling techniques on four common classifiers across 26 datasets in terms of AUC, balance, and MCC. 
Results:
The analysis of SMOTE-based and stable SMOTE-based oversampling techniques shows that the performance of stable SMOTE-based oversampling techniques is more stable and better than that of SMOTE-based oversampling techniques. The difference between the worst and best performances of SMOTE-based oversampling techniques is up to 23.3%, 32.6%, and 204.2% in terms of AUC, balance, and MCC, respectively. 
Conclusion:
Stable SMOTE-based oversampling techniques should be considered as a drop-in replacement for SMOTE-based oversampling techniques.

Research Area(s)

  • Class imbalance, Empirical Software Engineering, Oversampling, SMOTE, Software defect prediction