TY - GEN
T1 - The Impact of the bug number on Effort-Aware Defect Prediction
T2 - 14th Asia-Pacific Symposium on Internetware (Internetware 2023)
AU - Yang, Peixin
AU - Zhu, Lin
AU - Hu, Wenhua
AU - Keung, Jacky Wai
AU - Lu, Liping
AU - Xiang, Jianwen
PY - 2023
Y1 - 2023
N2 - Previous research have utilized public software defect datasets such as NASA, RELINK, and SOFTLAB, which only contain class label information. Almost all Effort-Aware Defect Prediction (EADP) studies are carried out around these datasets. However, EADP studies typically relying on bug density (i.e., the ratio between bug numbers and the lines of code) for ranking software modules. In order to investigate the impact of neglecting bug number information in software defect datasets on the performance of EADP models, we examine the performance degradation of the best-performing learning to rank methods when class labels are utilized instead of bug numbers. The experimental results show that neglecting bug number information in building EADP models results in an increase in the detected bugs. However, it also leads to a significant increase in the initial false alarms, ranging from 45.5% to 90.9% of the datasets, and an significant increase in the modules that need to be inspected, ranging from 5.2% to 70.4%. Therefore, we recommend not only the class labels but also the bug number information should be disclosed when publishing software defect datasets, in order to construct more accurate EADP models. © 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
AB - Previous research have utilized public software defect datasets such as NASA, RELINK, and SOFTLAB, which only contain class label information. Almost all Effort-Aware Defect Prediction (EADP) studies are carried out around these datasets. However, EADP studies typically relying on bug density (i.e., the ratio between bug numbers and the lines of code) for ranking software modules. In order to investigate the impact of neglecting bug number information in software defect datasets on the performance of EADP models, we examine the performance degradation of the best-performing learning to rank methods when class labels are utilized instead of bug numbers. The experimental results show that neglecting bug number information in building EADP models results in an increase in the detected bugs. However, it also leads to a significant increase in the initial false alarms, ranging from 45.5% to 90.9% of the datasets, and an significant increase in the modules that need to be inspected, ranging from 5.2% to 70.4%. Therefore, we recommend not only the class labels but also the bug number information should be disclosed when publishing software defect datasets, in order to construct more accurate EADP models. © 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
KW - Bug Number
KW - Effort-Aware
KW - Learning to Rank
KW - Software Defect Prediction
UR - http://www.scopus.com/inward/record.url?scp=85175716778&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-85175716778&origin=recordpage
U2 - 10.1145/3609437.3609458
DO - 10.1145/3609437.3609458
M3 - RGC 32 - Refereed conference paper (with host publication)
SN - 9798400708947
T3 - ACM International Conference Proceeding Series
SP - 67
EP - 78
BT - 14th Asia-Pacific Symposium on Internetware (Internetware 2023) - Proceedings
PB - Association for Computing Machinery
Y2 - 4 August 2023 through 6 August 2023
ER -