TY - GEN
T1 - ROCT
T2 - 45th IEEE Annual Computers, Software, and Applications Conference (COMPSAC 2021)
AU - Feng, Shuo
AU - Keung, Jacky
AU - Liu, Jie
AU - Xiao, Yan
AU - Yu, Xiao
AU - Zhang, Miao
PY - 2021
Y1 - 2021
N2 - The training data commonly used in software defect prediction (SDP) usually contains some instances that have similar values on features but are in different classes, which significantly degrades the performance of prediction models trained using these instances. This is referred to as the class overlap problem (COP). Previous studies have concluded that COP has a more negative impact on the performance of prediction models than the class imbalance problem (CIP). However, less research has been conducted on COP than CIP. Moreover, the performance of the existing class overlap cleaning techniques heavily relies on the settings of hyperparameters such as the value of K in the K-nearest neighbor algorithm or the K-means algorithm, but how to find those optimal hyperparameters is still a challenge. In this study, we propose a novel technique named the radius-based class overlap cleaning technique (ROCT) to better alleviate COP without tuning hyperparameters in SDP. The basic idea of ROCT is to take each instance as the center of a hypersphere and directly optimize the radius of the hypersphere. Then ROCT identifies those instances with the opposite label of the center instance as the overlapping instance and removes them. To investigate the performance of ROCT, we conduct the empirical experiment across 29 datasets collected from various software repositories on the K-nearest neighbor, random forest, logistic regression, and naive Bayes classifiers measured by AUC, balance, pd, and pf. The experimental results show that ROCT performs the best and significantly improves the performance of prediction models by as much as 15.2% and 29.9% in terms of AUC and balance compared with the existing class overlap cleaning techniques. The superior performance of ROCT indicates that ROCT should be recommended as an efficient alternative to alleviate COP in SDP. ©2021 IEEE.
AB - The training data commonly used in software defect prediction (SDP) usually contains some instances that have similar values on features but are in different classes, which significantly degrades the performance of prediction models trained using these instances. This is referred to as the class overlap problem (COP). Previous studies have concluded that COP has a more negative impact on the performance of prediction models than the class imbalance problem (CIP). However, less research has been conducted on COP than CIP. Moreover, the performance of the existing class overlap cleaning techniques heavily relies on the settings of hyperparameters such as the value of K in the K-nearest neighbor algorithm or the K-means algorithm, but how to find those optimal hyperparameters is still a challenge. In this study, we propose a novel technique named the radius-based class overlap cleaning technique (ROCT) to better alleviate COP without tuning hyperparameters in SDP. The basic idea of ROCT is to take each instance as the center of a hypersphere and directly optimize the radius of the hypersphere. Then ROCT identifies those instances with the opposite label of the center instance as the overlapping instance and removes them. To investigate the performance of ROCT, we conduct the empirical experiment across 29 datasets collected from various software repositories on the K-nearest neighbor, random forest, logistic regression, and naive Bayes classifiers measured by AUC, balance, pd, and pf. The experimental results show that ROCT performs the best and significantly improves the performance of prediction models by as much as 15.2% and 29.9% in terms of AUC and balance compared with the existing class overlap cleaning techniques. The superior performance of ROCT indicates that ROCT should be recommended as an efficient alternative to alleviate COP in SDP. ©2021 IEEE.
KW - Class imbalance
KW - Class overlap
KW - Data preprocessing
KW - Software defect prediction
UR - http://www.scopus.com/inward/record.url?scp=85115839204&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-85115839204&origin=recordpage
U2 - 10.1109/COMPSAC51774.2021.00041
DO - 10.1109/COMPSAC51774.2021.00041
M3 - RGC 32 - Refereed conference paper (with host publication)
SN - 9781665424639
T3 - Proceedings - IEEE Annual Computers, Software, and Applications Conference, COMPSAC
SP - 228
EP - 237
BT - Proceedings - 2021 IEEE 45th Annual Computers, Software, and Applications Conference, COMPSAC 2021
PB - IEEE
Y2 - 12 July 2021 through 16 July 2021
ER -