TY - JOUR
T1 - Less Is More
T2 - Unlocking Semi-Supervised Deep Learning for Vulnerability Detection
AU - YU, Xiao
AU - LIN, Guancheng
AU - HU, Xing
AU - KEUNG, Jacky Wai
AU - XIA, Xin
PY - 2025/3
Y1 - 2025/3
N2 - Deep learning has demonstrated its effectiveness in software vulnerability detection, but acquiring a large number of labeled code snippets for training deep learning models is challenging due to labor-intensive annotation. With limited labeled data, complex deep learning models often suffer from overfitting and poor performance. To address this limitation, semi-supervised deep learning offers a promising approach by annotating unlabeled code snippets with pseudo-labels and utilizing limited labeled data together as training sets to train vulnerability detection models. However, applying semi-supervised deep learning for accurate vulnerability detection comes with several challenges. One challenge lies in how to select correctly pseudo-labeled code snippets as training data, while another involves mitigating the impact of potentially incorrectly pseudo-labeled training code snippets during model training. To address these challenges, we propose the semi-supervised vulnerability detection (SSVD) approach. SSVD leverages the information gain of model parameters as the certainty of the correctness of pseudo-labels and prioritizes high-certainty pseudo-labeled code snippets as training data. Additionally, it incorporates the proposed noise-robust triplet loss to maximize the separation between vulnerable and non-vulnerable code snippets to better propagate labels from labeled code snippets to nearby unlabeled snippets and utilizes the proposed noise-robust cross-entropy loss for gradient clipping to mitigate the error accumulation caused by incorrect pseudo-labels. We evaluate SSVD with nine semi-supervised approaches on four widely-used public vulnerability datasets. The results demonstrate that SSVD outperforms the baselines with an average of 29.82% improvement in terms of F1-score and 56.72% in terms of MCC. In addition, SSVD trained on a certain proportion of labeled data can outperform or closely match the performance of fully supervised LineVul and ReVeal vulnerability detection models trained on 100% labeled data in most scenarios. This indicates that SSVD can effectively learn from limited labeled data to enhance vulnerability detection performance, thereby reducing the effort required for labeling a large number of code snippets. © 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.
AB - Deep learning has demonstrated its effectiveness in software vulnerability detection, but acquiring a large number of labeled code snippets for training deep learning models is challenging due to labor-intensive annotation. With limited labeled data, complex deep learning models often suffer from overfitting and poor performance. To address this limitation, semi-supervised deep learning offers a promising approach by annotating unlabeled code snippets with pseudo-labels and utilizing limited labeled data together as training sets to train vulnerability detection models. However, applying semi-supervised deep learning for accurate vulnerability detection comes with several challenges. One challenge lies in how to select correctly pseudo-labeled code snippets as training data, while another involves mitigating the impact of potentially incorrectly pseudo-labeled training code snippets during model training. To address these challenges, we propose the semi-supervised vulnerability detection (SSVD) approach. SSVD leverages the information gain of model parameters as the certainty of the correctness of pseudo-labels and prioritizes high-certainty pseudo-labeled code snippets as training data. Additionally, it incorporates the proposed noise-robust triplet loss to maximize the separation between vulnerable and non-vulnerable code snippets to better propagate labels from labeled code snippets to nearby unlabeled snippets and utilizes the proposed noise-robust cross-entropy loss for gradient clipping to mitigate the error accumulation caused by incorrect pseudo-labels. We evaluate SSVD with nine semi-supervised approaches on four widely-used public vulnerability datasets. The results demonstrate that SSVD outperforms the baselines with an average of 29.82% improvement in terms of F1-score and 56.72% in terms of MCC. In addition, SSVD trained on a certain proportion of labeled data can outperform or closely match the performance of fully supervised LineVul and ReVeal vulnerability detection models trained on 100% labeled data in most scenarios. This indicates that SSVD can effectively learn from limited labeled data to enhance vulnerability detection performance, thereby reducing the effort required for labeling a large number of code snippets. © 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.
KW - Information Gain
KW - Semi-Supervised Learning
KW - Vulnerability Detection
UR - http://www.scopus.com/inward/record.url?scp=86000367475&partnerID=8YFLogxK
UR - https://www.scopus.com/record/pubmetrics.uri?eid=2-s2.0-86000367475&origin=recordpage
U2 - 10.1145/3699602
DO - 10.1145/3699602
M3 - RGC 21 - Publication in refereed journal
SN - 1049-331X
VL - 34
JO - ACM Transactions on Software Engineering and Methodology
JF - ACM Transactions on Software Engineering and Methodology
IS - 3
M1 - 62
ER -