TY - JOUR
T1 - On the relative value of imbalanced learning for code smell detection
AU - Li, Fuyang
AU - Zou, Kuan
AU - Keung, Jacky Wai
AU - Yu, Xiao
AU - Feng, Shuo
AU - Xiao, Yan
PY - 2023/10
Y1 - 2023/10
N2 - Machine learning-based code smell detection (CSD) has been demonstrated to be a valuable approach for improving software quality and enabling developers to identify problematic patterns in code. However, previous researches have shown that the code smell datasets commonly used to train these models are heavily imbalanced. While some recent studies have explored the use of imbalanced learning techniques for CSD, they have only evaluated a limited number of techniques and thus their conclusions about the most effective methods may be biased and inconclusive. To thoroughly evaluate the effect of imbalanced learning techniques for machine learning-based CSD, we examine 31 imbalanced learning techniques with seven classifiers to build CSD models on four code smell data sets. We employ four evaluation metrics to assess the detection performance with the Wilcoxon signed-rank test and Cliff's δ. The results show that (1) Not all imbalanced learning techniques significantly improve detection performance, but deep forest significantly outperforms the other techniques on all code smell data sets. (2) SMOTE (Synthetic Minority Over-sampling TEchnique) is not the most effective technique for resampling code smell data sets. (3) The best-performing imbalanced learning techniques and the top-3 data resampling techniques have little time cost for code smell detection. Therefore, we provide some practical guidelines. First, researchers and practitioners should select the appropriate imbalanced learning techniques (e.g., deep forest) to ameliorate the class imbalance problem. In contrast, the blind application of imbalanced learning techniques could be harmful. Then, better data resampling techniques than SMOTE should be selected to preprocess the code smell data sets. © 2023 John Wiley & Sons, Ltd.
AB - Machine learning-based code smell detection (CSD) has been demonstrated to be a valuable approach for improving software quality and enabling developers to identify problematic patterns in code. However, previous researches have shown that the code smell datasets commonly used to train these models are heavily imbalanced. While some recent studies have explored the use of imbalanced learning techniques for CSD, they have only evaluated a limited number of techniques and thus their conclusions about the most effective methods may be biased and inconclusive. To thoroughly evaluate the effect of imbalanced learning techniques for machine learning-based CSD, we examine 31 imbalanced learning techniques with seven classifiers to build CSD models on four code smell data sets. We employ four evaluation metrics to assess the detection performance with the Wilcoxon signed-rank test and Cliff's δ. The results show that (1) Not all imbalanced learning techniques significantly improve detection performance, but deep forest significantly outperforms the other techniques on all code smell data sets. (2) SMOTE (Synthetic Minority Over-sampling TEchnique) is not the most effective technique for resampling code smell data sets. (3) The best-performing imbalanced learning techniques and the top-3 data resampling techniques have little time cost for code smell detection. Therefore, we provide some practical guidelines. First, researchers and practitioners should select the appropriate imbalanced learning techniques (e.g., deep forest) to ameliorate the class imbalance problem. In contrast, the blind application of imbalanced learning techniques could be harmful. Then, better data resampling techniques than SMOTE should be selected to preprocess the code smell data sets. © 2023 John Wiley & Sons, Ltd.
KW - code smell detection
KW - empirical software engineering
KW - imbalanced learning
KW - machine learning
UR - http://www.scopus.com/inward/record.url?scp=85162929161&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-85162929161&origin=recordpage
U2 - 10.1002/spe.3235
DO - 10.1002/spe.3235
M3 - RGC 21 - Publication in refereed journal
SN - 0038-0644
VL - 53
SP - 1902
EP - 1927
JO - Software - Practice and Experience
JF - Software - Practice and Experience
IS - 10
ER -