Comprehensive Analysis of Data Sampling Approaches on the Performance of Software Defect Prediction Models


Student thesis: Doctoral Thesis

View graph of relations

Related Research Unit(s)


Awarding Institution
Award date16 Aug 2018


Software defect prediction models can conveniently identify faulty modules and these models aid in the prioritization of scarce testing resources. The performance of defect prediction models are known to be dependent on the datasets used for training the models. Unfortunately, defect datasets are imbalanced with the non-defective modules dominating the defective modules. These imbalanced datasets thus make accurate and reliable predictions difficult. A prevalent approach to alleviate the class imbalance of defect datasets before model training is the application of data sampling approaches. The application of data sampling approaches to imbalanced defect data during model training has been shown to affect performance of defect prediction models in previous studies. However, very little is known about the best distribution for attaining high performance especially in a more practical scenario. There are still inconclusive results pertaining to the suitable ratio of defect and clean instances(resampling rate), the statistical and practical impacts of resampling approaches on prediction performance and the more stable resampling approach across several performance measures. Furthermore, the effects of most sampling methods have been evaluated in terms of accuracy/geometric mean/F1-measure in previous studies; however, these measures do not consider the effort needed to fix faults.

In this thesis, we investigate the impact that sampling methods have on prediction performance. An in-depth analysis of sampling methods and the various components (i.e. different types, configuration parameters) on performance of defect prediction models is conducted. A case study on both static and process metric open-source projects reveal that (1) the effectiveness of sampling methods are dependent on the resampling rate (2) sampling methods significantly improve performance but it is accompanied with significantly high false alarms. (3) sampling methods specifically under-sampling do not improve prediction performance when evaluated with an effort-aware measure. Our findings suggest that for better prediction performance, researchers should carefully select the resampling rate based on the sampling approach. Stimulated by our findings that most of the data sampling approaches result in over-generalization (high rates of false alarms) and generate near-duplicated data instances (less diverse data), a diversity-based oversampling method named MAHAKIL is further proposed which aims to produce high recall (probability of detection) and low pf (false alarms). Our experiments indicate that MAHAKIL improves the prediction performance for all the models and achieves better and more significant pf values than the other oversampling approaches, based on Brunner's statistical significance test and Cliff's effect sizes. Therefore, MAHAKIL is strongly recommended as an efficient alternative for defect prediction models built on highly imbalanced datasets.

    Research areas

  • Software Defect Prediction Modelling, Imbalanced data, Data Sampling Techniques, Empirical software engineering