Comprehensive Investigation of Concept Drift on the Performance of Software Defect Prediction Models
軟體缺陷預測模型性能概念漂移的綜合研究
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 10 Aug 2021 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(81c11226-c77f-4869-8be4-3ac6876fe250).html |
---|---|
Other link(s) | Links |
Abstract
Concept Drift (CD) is a phenomenon in which the data distribution in datasets shifts over time, which negatively affects the performance of prediction models in software engineering (SE), such as software defect prediction (SDP) models. Detecting CD in SE datasets and improving the prediction performance of models impacted by CD is a challenge. Over time, a long-running software project experiences several versions, producing high volumes of data and making non-stationary data distribution. Furthermore, the relationship between data variables changes (i.e., concept drift). We can assume that data of project versions appear in the form of temporal data distributions. Predicting defects of such a project is challenging.
In SDP, the prediction models utilize historical data (from which the prediction model is built) and predict the defects for upcoming projects (with which the prediction model is tested). The performance of prediction models is known to be dependent on the datasets used for training the models. To improve the information content of the training datasets that appear over time, the impact of concept drift should be considered, which has not been extensively investigated. Among the SDP techniques, cross-version defect prediction (CVDP) is a process where the historical data is retrieved from the previous version of the software project and tested on the current version of the same project, which is considered more practical and realistic. In such a realistic scenario, the software project version is updated and maintains a chronological order of the project versions. The impact of concept drift on the performance of classification-based SDP models has never been explored over the chronological defect datasets. Unfortunately, the defect datasets are imbalanced, with the non-defective modules dominating the defective modules. Class rebalancing techniques that modify the distribution of training datasets have been shown to improve the prediction performance of defect prediction models in prior research. Since class rebalancing techniques try to produce a similar representation of the classes, it may eliminate drift and impact the performance of classification-based SDP models, which has not been thoroughly investigated yet. Furthermore, prior works show the effectiveness of feature selection techniques to improve the performance of SDP. Researchers attributed that the prediction results fluctuate due to the changes in the data features. Therefore, feature selection techniques may avoid CD due to the variation of selecting features in each version of the software project, which has not yet been examined.
In this thesis, we investigate the impact of CD on SDP performances. A systematic investigation is conducted to examine whether class rebalancing could help eliminate CD from chronological defect datasets. To deal CD while considering class imbalance issue, we propose a novel concept drift detection (CDD) framework to detect CD in the cross-version (CV) defect datasets and investigate the feasibility of alleviating CD in CVDP models using class rebalancing techniques. The framework consists of four steps, in which the first pre-processes the CV defect datasets and forms a CV data stream, the second constructs the CV defect models, the third calculates the test statistics, and the fourth provides a hypothesis-test based CD detection method. We demonstrated the ability of the CDD framework by conducting experiments on 36 versions of 10 open-source software projects. We found that 50% of the CV defect datasets are drift-prone. Additionally, the class rebalancing techniques had a positive impact on the prediction performance for CVDP by correctly classifying the CV defective modules and detected CD by up to 31% on the resampled datasets. We observed that the class rebalancing techniques are beneficial when the practitioners wish to increase the ability to correctly classifying the CV defective modules. In addition to this, we formulate CVDP as a temporal chunk-based learning problem in which the data of different versions of a software project appear at each time step. To assess the impact of DP while using feature selection techniques in chunk-based temporal learning for improving the CVDP models, we investigated whether applying feature selection techniques separately on each version of the project when building classification-based CVDP models could improve prediction performance and the robustness of these trained CVDP models to CDs. The Friedman and Nemenyi Post-hoc test results indicate that there were statistical differences between the prediction results with and without feature selection techniques when evaluated with the AUC, Recall, and pf. CVDP models trained on the most recent versions are not always the best cross-version defect predictors. When the prediction models were updated by selecting the features from each version, the feature selection techniques could help reduce drift from the defect datasets.
In SDP, the prediction models utilize historical data (from which the prediction model is built) and predict the defects for upcoming projects (with which the prediction model is tested). The performance of prediction models is known to be dependent on the datasets used for training the models. To improve the information content of the training datasets that appear over time, the impact of concept drift should be considered, which has not been extensively investigated. Among the SDP techniques, cross-version defect prediction (CVDP) is a process where the historical data is retrieved from the previous version of the software project and tested on the current version of the same project, which is considered more practical and realistic. In such a realistic scenario, the software project version is updated and maintains a chronological order of the project versions. The impact of concept drift on the performance of classification-based SDP models has never been explored over the chronological defect datasets. Unfortunately, the defect datasets are imbalanced, with the non-defective modules dominating the defective modules. Class rebalancing techniques that modify the distribution of training datasets have been shown to improve the prediction performance of defect prediction models in prior research. Since class rebalancing techniques try to produce a similar representation of the classes, it may eliminate drift and impact the performance of classification-based SDP models, which has not been thoroughly investigated yet. Furthermore, prior works show the effectiveness of feature selection techniques to improve the performance of SDP. Researchers attributed that the prediction results fluctuate due to the changes in the data features. Therefore, feature selection techniques may avoid CD due to the variation of selecting features in each version of the software project, which has not yet been examined.
In this thesis, we investigate the impact of CD on SDP performances. A systematic investigation is conducted to examine whether class rebalancing could help eliminate CD from chronological defect datasets. To deal CD while considering class imbalance issue, we propose a novel concept drift detection (CDD) framework to detect CD in the cross-version (CV) defect datasets and investigate the feasibility of alleviating CD in CVDP models using class rebalancing techniques. The framework consists of four steps, in which the first pre-processes the CV defect datasets and forms a CV data stream, the second constructs the CV defect models, the third calculates the test statistics, and the fourth provides a hypothesis-test based CD detection method. We demonstrated the ability of the CDD framework by conducting experiments on 36 versions of 10 open-source software projects. We found that 50% of the CV defect datasets are drift-prone. Additionally, the class rebalancing techniques had a positive impact on the prediction performance for CVDP by correctly classifying the CV defective modules and detected CD by up to 31% on the resampled datasets. We observed that the class rebalancing techniques are beneficial when the practitioners wish to increase the ability to correctly classifying the CV defective modules. In addition to this, we formulate CVDP as a temporal chunk-based learning problem in which the data of different versions of a software project appear at each time step. To assess the impact of DP while using feature selection techniques in chunk-based temporal learning for improving the CVDP models, we investigated whether applying feature selection techniques separately on each version of the project when building classification-based CVDP models could improve prediction performance and the robustness of these trained CVDP models to CDs. The Friedman and Nemenyi Post-hoc test results indicate that there were statistical differences between the prediction results with and without feature selection techniques when evaluated with the AUC, Recall, and pf. CVDP models trained on the most recent versions are not always the best cross-version defect predictors. When the prediction models were updated by selecting the features from each version, the feature selection techniques could help reduce drift from the defect datasets.
- Software Defect Prediction, Concept Drift, Class Rebalancing Techniques, Feature Selection Techniques, Empirical software engineering