Improving Software Defect Prediction by Developing Techniques for Alleviating the Class Imbalance and the Class Overlap Problems


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date10 Aug 2021


By leveraging the historical software instances (e.g., files, packages, or functions) to train machine learning classifiers, prediction models are built and can be used to predict whether the instances introduced in the future are defective, which is called software defect prediction (SDP). However, there are practical issues that significantly hinder the performance of prediction models in SDP, such as the class imbalance problem and the class overlap problem. Generally, there are more non-defective instances than defective ones, making prediction models focus more on the non-defective instances and ignore the defective ones. This is called the class imbalance problem. The class overlap problem is that the defective instances and the non-defective instances are mixed up in the feature space, making prediction models difficult to classify defective instances from non-defective ones.

The class resampling technique is the most common technique to alleviate the class imbalance problem in SDP, because it is independent of prediction models and easy to employ. The class resampling technique can be categorized into two general types: the undersampling technique and the oversampling technique. The undersampling technique removes non-defective instances, while the oversampling technique adds synthetic defective instances into the original datasets to achieve the balance. The oversampling technique is more popular than the undersampling technique in SDP. According to our previous study, the overall performances of the existing oversampling techniques are similar. However, these oversampling techniques select instances that are either too close or too distant in distance to generate synthetic instances, which leads to either a high probability of false alarm (pf) or a low probability of detection (pd). We conduct an empirical study to investigate the impact of distance on the performance of the oversampling techniques. Based on the results of the empirical study, we propose a novel oversampling technique named the complexity-based oversampling technique (COSTE) that can simultaneously achieve low pf and high pd values. The experimental results show that COSTE greatly improves the diversity of the synthetic instances without compromising the ability of prediction models to find defects. The experimental results are further validated by the Wilcoxon signed-rank test and Cliff’s δ effect size.

However, according to the previous study, the random undersampling (RUS), the simplest undersampling technique, outperforms several state-of-the-art oversampling techniques in SDP. Although RUS performs well, important information provided by non-defective instances for prediction models may be lost, because non-defective instances are randomly removed by RUS. Therefore, we propose a novel undersampling technique named the learning-to-rank undersampling technique (LTRUS). LTRUS ranks non-defective instances based on their importance, and the less important instances will be removed, which can reserve as much information as possible for prediction models. The performance of LTRUS significantly outperforms several class resampling techniques such as RUS and Synthetic Minority Oversampling TEchnique (SMOTE) in terms of AUC and balance.

Although the class resampling techniques perform well, several studies have pointed out that the class overlap problem takes the primary responsibility for the degradation in the performance of prediction models. Based on this conclusion, if the performance of the existing class overlap cleaning techniques is not better than the existing resampling techniques, it indicates that the current class overlap cleaning techniques are not effective enough. Besides, the performance of the current class overlap cleaning techniques heavily relies on the appropriate setting of the hyperparameters, but how to decide the optimal hyperparameters is still a challenge. Therefore, we propose a novel class overlap cleaning technique named the radius-based class overlap cleaning technique (ROCT), which significantly improves the performance of the class overlap cleaning technique without tuning the hyperparameters. Furthermore, ROCT significantly outperforms several state-of-the-art class resampling techniques, which empirically confirms that the class overlap problem plays a significant role in affecting the performance of prediction models.

In conclusion, this thesis aims to properly preprocess data to build effective prediction models in SDP, which can guide the allocation of the limited testing resource to more defect-prone instances and lead to the better quality of software projects.

    Research areas

  • data preprocessing, software defect prediction, class imbalance problem, class overlap problem, class resampling