CASMS : Combining clustering with attention semantic model for identifying security bug reports

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

17 Scopus Citations
View graph of relations

Related Research Unit(s)

Detail(s)

Original languageEnglish
Article number106906
Journal / PublicationInformation and Software Technology
Volume147
Online published26 Mar 2022
Publication statusPublished - Jul 2022

Abstract

Context: Inappropriate public disclosure of security bug reports (SBRs) is likely to attract malicious attackers to invade software systems; hence being able to detect SBRs has become increasingly important for software maintenance. Due to the class imbalance problem that the number of non-security bug reports (NSBRs) exceeds the number of SBRs, insufficient training information, and weak performance robustness, the existing techniques for identifying SBRs are still less than desirable.
Objective: This prompted us to overcome the challenges of the most advanced SBR detection methods.
Method: In this work, we propose the CASMS approach to efficiently alleviate the imbalance problem and predict bug reports. CASMS first converts bug reports into weighted word embeddings based on 𝑡𝑓 − 𝑖𝑑𝑓 and 𝑤𝑜𝑟𝑑2𝑣𝑒c techniques. Unlike the previous studies selecting the NSBRs that are the most dissimilar to SBRs, CASMS then automatically finds a certain number of diverse NSBRs via the Elbow method and k-means clustering algorithm. Finally, the selected NSBRs and all SBRs train an effective Attention CNN–BLSTM model to extract contextual and sequential information.
Results: The experimental results have shown that CASMS is superior to the three baselines (i.e., FARSEC, SMOTUNED, and LTRWES) in assessing the overall performance (g-measure) and correctly identifying SBRs (recall), with improvements of 4.09%–24.26% and 10.33%–36.24%, respectively. The best results are easily obtained under the limited ratio ranges of the two-class training set (1:1 to 3:1), with around 20 experiments for each project. By evaluating the robustness of CASMS via the standard deviation indicator, CASMS is more stable than LTRWES.
Conclusion: Overall, CASMS can alleviate the data imbalance problem and extract more semantic information to improve performance and robustness. Therefore, CASMS is recommended as a practical approach for identifying SBRs.

Research Area(s)

  • Security bug report, Clustering, Hybrid neural networks