SemiRALD : A semi-supervised hybrid language model for robust Anomalous Log Detection

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

View graph of relations

Related Research Unit(s)

Detail(s)

Original languageEnglish
Article number107743
Number of pages19
Journal / PublicationInformation and Software Technology
Volume183
Online published11 Apr 2025
Publication statusOnline published - 11 Apr 2025

Abstract

Context: Deep learning-based Anomalous Log Detection (DALD) tools are critical for software reliability, but current approaches face challenges, including information loss during log parsing, reliance on large labeled datasets, and fragility in low-resource scenarios.

Objective: To overcome the above limitations, we propose SemiRALD, a semi-supervised learning-based robust ALD approach that leverages Large Language Model (LLM) for log parsing, enhancing both flexibility and accuracy. It utilizes a hybrid language model to repeatedly fit the samples with generate pseudo-labels, thereby training DALD models with limited resources and facilitating efficient anomaly detection tasks.

Method: In detail, SemiRALD utilizes ChatGPT and in-context learning for automated log parsing, thereby improving the log integrity during log parsing. Subsequently, it harnesses a semi-supervised learning framework and our proposed hybrid language model to remedy the performance degeneration caused by low-resource restriction in practice. Semi-supervised learning requires only a small amount of labeled data throughout the entire process, while the hybrid language model is built on the architecture of RoBERTa and an attention-based BiLSTM.

Results: Experiments on the HDFS and BGL datasets demonstrate that SemiRALD achieves an average F1-score improvement of 7.3% and 8.2%, respectively, over seven benchmark models. On small-scale datasets (0.1% of the original size), SemiRALD outperforms competitors by 31.4% and 46.0% in F1-score, respectively. Its consistent performance across diverse datasets highlights its generalizability and robustness.

Conclusion: SemiRALD is capable of handling anomaly detection tasks in both large-scale and low-resource datasets, delivering significant advancements in anomaly log detection and offering robust, adaptable solutions to address prevalent challenges in the field of software reliability engineering.
© 2025 Elsevier B.V.

Research Area(s)

  • Anomaly log detection, Software reliability, Log parsing, RoBERTa, Bi-LSTM, Semi-supervised learning