Abstract
The integration of deep learning techniques into bug report and log anomaly detection represents a foundational aspect of software and system maintenance. Typically, these analyses encompass two primary stages: text representation and representation learning. Text representation involves converting raw text data into machine-readable formats, such as embeddings in the feature space, while representation learning utilizes a variety of techniques, including machine learning and deep learning, to effectively capture these representation features. Furthermore, clustering and resampling techniques are commonly used to enhance representation learning by facilitating the identification of meaningful data patterns and addressing issues like class imbalance.In this thesis, we embark on a comprehensive exploration of four pivotal software engineering tasks, each tackling crucial aspects of bug report analysis and log anomaly detection: (1) security bug report detection, (2) bug report title generation, (3) log anomaly detection, and (4) data resampling in log anomaly detection. We propose three novel deep learning approaches (CASMS, AttSum, AdaLog) and conduct an empirical study to address ongoing research challenges.
Specifically, bug reports (BRs) serve as indispensable records in software maintenance, documenting issues reported by users for resolution by software developers. Given the different severity of bug reports, prioritizing the detection of Security Bug Reports (SBRs) is critical. Identifying and resolving bugs exposed in SBRs can help mitigate potential damage to software products before disclosure. Moreover, the bug report title is a compulsory component of a BR, making the automatic generation of high-quality bug report titles a focal point for researchers. This task aims to effectively extract the core content of a BR, thereby improving its readability and usability. Logs play a pivotal role in maintaining software-intensive systems, capturing vital information essential for troubleshooting and performance monitoring. Consequently, the analysis of logs for anomaly detection emerges as a significant area of interest. Despite numerous studies dedicated to these tasks in recent years, persistent and enduring research challenges continue to remain unresolved.
In the realm of security bug report detection, the inappropriate disclosure of SBRs poses significant risks to software systems. However, prevailing techniques face challenges arising from class imbalance, inadequate training data, and limited performance robustness. To address these issues, we introduce the CASMS approach. CASMS leverages weighted word embeddings derived from tf-idf and word2vec techniques to transform bug reports. Subsequently, it utilizes the Elbow method and k-means clustering algorithm to automatically select diverse non-security bug reports (NSBRs), with the aim of addressing the class imbalance. Finally, we employ an Attention CNN–BLSTM model to process both SBRs and selected NSBRs, efficiently extracting contextual and sequential information for representation learning.
Towards bug report title generation, the creation of concise and informative titles plays a crucial role in facilitating efficient bug resolution processes. However, the challenge lies in the ability of bug reporters to generate high-quality titles. To mitigate this, we propose AttSum, an innovative deep attention-based summarization model. AttSum employs an encoder-decoder framework, utilizing the robustly optimized bidirectional-encoder-representations-from-transformers (RoBERTa) approach for deep text representation of bug reports. Moreover, it incorporates a stacked transformer decoder to generate bug report title tokens. Both the RoBERTa encoder and the stacked transformer decoder utilize multi-head attention operations, enabling them to memorize previous information and capture long-range dependencies between input and output sequences, regardless of distance. Consequently, AttSum excels in extracting and summarizing the global semantic information of bug report bodies, enhancing its overall performance. Additionally, we employ a copy mechanism to address the rare-term problem effectively.
In the domain of log anomaly detection (LAD), a variety of machine learning and deep learning approaches have been proposed, categorized into supervised, semi-supervised, and unsupervised methods. While semi-supervised techniques show promise by requiring only a fraction of labeled data and demonstrating relative stability, current approaches often suffer from manual parameter tuning and high false positives. To address these challenges, we introduce AdaLog, an integrated semi-supervised approach. Specifically, AdaLog utilizes a pre-trained model for word embedding, employs a self-adaptive clustering method to accurately calculate label probabilities for unlabeled data across twelve designed scenarios, and integrates a transformer-based model for prediction. Furthermore, to mitigate class imbalance, AdaLog incorporates undersampling to enhance model performance.
In the realm of LAD, the persistent issue of class imbalance in publicly available data hinders the effectiveness of deep learning-based models. Nonetheless, whether data resampling can address this imbalance remains uncertain. To bridge this gap, we conduct a comprehensive analysis of various data resampling techniques and their impact on existing deep learning-based log anomaly detection (DLLAD) approaches. Through empirical evaluations on multiple benchmarks, we elucidate the effectiveness of resampling techniques in alleviating class imbalance and improving the performance of DLLAD approaches. Our study offers a valuable roadmap for researchers aiming to tackle the challenge of data imbalance in DLLAD. By implementing the recommended strategies for data resampling, researchers can significantly improve the performance and effectiveness of DLLAD approaches.
In summary, this thesis introduces novel approaches and empirical studies geared toward enhancing bug report analysis and log anomaly detection in software and system maintenance. Through our proposed approaches and empirical insights, we aim to make meaningful contributions to the continual advancement of robust and effective techniques within these domains.
| Date of Award | 30 Aug 2024 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Jacky Wai Keung (Supervisor) |
Keywords
- Text Representation Learning
- Security Bug Report Detection
- Title Generation
- Log Anomaly Detection
- Data Resampling