Less Is More: Unlocking Semi-Supervised Deep Learning for Vulnerability Detection

Xiao YU, Guancheng LIN, Xing HU*, Jacky Wai KEUNG, Xin XIA

*Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

4 Citations (Scopus)

Abstract

Deep learning has demonstrated its effectiveness in software vulnerability detection, but acquiring a large number of labeled code snippets for training deep learning models is challenging due to labor-intensive annotation. With limited labeled data, complex deep learning models often suffer from overfitting and poor performance. To address this limitation, semi-supervised deep learning offers a promising approach by annotating unlabeled code snippets with pseudo-labels and utilizing limited labeled data together as training sets to train vulnerability detection models. However, applying semi-supervised deep learning for accurate vulnerability detection comes with several challenges. One challenge lies in how to select correctly pseudo-labeled code snippets as training data, while another involves mitigating the impact of potentially incorrectly pseudo-labeled training code snippets during model training. To address these challenges, we propose the semi-supervised vulnerability detection (SSVD) approach. SSVD leverages the information gain of model parameters as the certainty of the correctness of pseudo-labels and prioritizes high-certainty pseudo-labeled code snippets as training data. Additionally, it incorporates the proposed noise-robust triplet loss to maximize the separation between vulnerable and non-vulnerable code snippets to better propagate labels from labeled code snippets to nearby unlabeled snippets and utilizes the proposed noise-robust cross-entropy loss for gradient clipping to mitigate the error accumulation caused by incorrect pseudo-labels. We evaluate SSVD with nine semi-supervised approaches on four widely-used public vulnerability datasets. The results demonstrate that SSVD outperforms the baselines with an average of 29.82% improvement in terms of F1-score and 56.72% in terms of MCC. In addition, SSVD trained on a certain proportion of labeled data can outperform or closely match the performance of fully supervised LineVul and ReVeal vulnerability detection models trained on 100% labeled data in most scenarios. This indicates that SSVD can effectively learn from limited labeled data to enhance vulnerability detection performance, thereby reducing the effort required for labeling a large number of code snippets. © 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Original languageEnglish
Article number62
JournalACM Transactions on Software Engineering and Methodology
Volume34
Issue number3
Online published23 Feb 2025
DOIs
Publication statusPublished - Mar 2025

Funding

This research was supported by the advanced computing resources provided by the Supercomputing Center of Hangzhou City University, Ningbo Natural Science Foundation (No. 2023J292), and the General Research Fund of the Research Grants Council of Hong Kong and the research funds of the City University of Hong Kong (6000796, 9229109, 9229098, 9220103, 9229029).

Research Keywords

  • Information Gain
  • Semi-Supervised Learning
  • Vulnerability Detection

Fingerprint

Dive into the research topics of 'Less Is More: Unlocking Semi-Supervised Deep Learning for Vulnerability Detection'. Together they form a unique fingerprint.

Cite this