Lexical Mismatch Analysis for Bug Localization along with Extracting Semantic and Structural Features
錯誤定位的詞法不匹配分析以及語義和結構特徵的提取
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 23 Aug 2019 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(c4c4a37b-94b9-45ca-a20c-ad32805a81f8).html |
---|---|
Other link(s) | Links |
Abstract
Automatic localization of buggy files can speed up the process of bug fixing to improve the efficiency and productivity of software quality assurance teams. The challenge of locating bugs in mostly large-scale software systems has led to the development of bug localization techniques. Most prevalent approaches based on information retrieval, machine learning and deep learning techniques achieve encouraging results. However, several challenges still lie in the field of bug localization given bug reports. First, bug reports are written in natural languages, while source files are written in code tokens. The lexical mismatch between bug reports and source codes degrades the performances of existing information retrieval or machine learning-based approaches. Furthermore, both bug reports and source files are written in words by humans, where useful semantic information is available, but it is usually underutilized by existing bug localization approaches. On the other hand, compared to natural languages, programs contain more stringent structural information. Correctly localizing buggy files for bug reports together with their semantic and structural information is a crucial task, which would essentially improve the accuracy of bug localization techniques.
In this thesis, we introduce several models to address the aforementioned problems. First, to bridge the lexical gap and improve the effectiveness of localizing buggy files by leveraging the extracted semantic information from bug reports and source code, we present BugTranslator, a novel deep learning-based machine translation technique composed of an attention-based recurrent neural network (RNN) Encoder-Decoder with long short-term memory cells. One RNN encodes bug reports into several context vectors that are decoded by another RNN into code tokens of buggy files. The technique studies and adopts the relevance between the extracted semantic information from bug reports and source files. An enhanced version using character-level convolutional neural network (CNN) is then discussed to improve the accuracy of bug localization for bug reports by expressing them in character and analyzing them with a language model. The proposed model is composed of two main parts: character-level CNN and RNN language model. Both bug reports and source files are expressed in a character level and then input into a CNN, whose output is given to an RNN encoder-decoder architecture. BugTranslator and the enhanced version distinguish bug reports and source code into different symbolic classes and then extracts deep semantic similarity and relevance between bug reports and the corresponding buggy files to bridge the lexical gap at its source, thereby further improving the accuracy of bug localization.
Second, to preserve semantic information, we adapt word embedding techniques to transform words in bug reports and source files into word vectors. Meanwhile, to parse semantics from word vectors, we enhance CNN by making use of the important bug-fixing experience (bug-fixing recency and frequency) besides the semantic information that can be extracted by conventional CNN from bug reports and source files. Based on these, DeepLocator and DeepLoc, two deep learning-based models are proposed to improve the accuracy of bug localization by making full use of semantic information. DeepLoc is capable of automatically connecting bug reports to the corresponding buggy files and achieves better performance than four other existing approaches based on a deep understanding of semantics in bug reports and source code.
Third, to empirically evaluate and demonstrate the effects of both semantic and structural information in bug reports and source files on improving the accuracy of bug localization, CNN_Forest is proposed, which involves CNN and an ensemble of random forests that have excellent performance in the tasks of semantic parsing and structural information extraction. CNN_Forest uses CNN with multiple filters and an ensemble of random forests with multi-grained scanning to extract semantic and structural features from the word vectors derived from bug reports and source files. And a subsequent cascade forest (a cascade of ensembles of random forests) is used to further extract deeper features and observe the correlated relationships between bug reports and source files. CNN_Forest is capable of defining the correlated relationships between bug reports and source files, and we empirically show that semantic and structural information in bug reports and source files are crucial in improving bug localization.
Our experiments indicate that the proposed models can increase the accuracy of bug localization and empirically demonstrate the significance of including both semantic and structural information in bug localization and bridging the lexical gap between bug reports and source files. We hope that the promising results will interest and encourage more in-depth research in these problems to further improve the accuracy of bug localization.
In this thesis, we introduce several models to address the aforementioned problems. First, to bridge the lexical gap and improve the effectiveness of localizing buggy files by leveraging the extracted semantic information from bug reports and source code, we present BugTranslator, a novel deep learning-based machine translation technique composed of an attention-based recurrent neural network (RNN) Encoder-Decoder with long short-term memory cells. One RNN encodes bug reports into several context vectors that are decoded by another RNN into code tokens of buggy files. The technique studies and adopts the relevance between the extracted semantic information from bug reports and source files. An enhanced version using character-level convolutional neural network (CNN) is then discussed to improve the accuracy of bug localization for bug reports by expressing them in character and analyzing them with a language model. The proposed model is composed of two main parts: character-level CNN and RNN language model. Both bug reports and source files are expressed in a character level and then input into a CNN, whose output is given to an RNN encoder-decoder architecture. BugTranslator and the enhanced version distinguish bug reports and source code into different symbolic classes and then extracts deep semantic similarity and relevance between bug reports and the corresponding buggy files to bridge the lexical gap at its source, thereby further improving the accuracy of bug localization.
Second, to preserve semantic information, we adapt word embedding techniques to transform words in bug reports and source files into word vectors. Meanwhile, to parse semantics from word vectors, we enhance CNN by making use of the important bug-fixing experience (bug-fixing recency and frequency) besides the semantic information that can be extracted by conventional CNN from bug reports and source files. Based on these, DeepLocator and DeepLoc, two deep learning-based models are proposed to improve the accuracy of bug localization by making full use of semantic information. DeepLoc is capable of automatically connecting bug reports to the corresponding buggy files and achieves better performance than four other existing approaches based on a deep understanding of semantics in bug reports and source code.
Third, to empirically evaluate and demonstrate the effects of both semantic and structural information in bug reports and source files on improving the accuracy of bug localization, CNN_Forest is proposed, which involves CNN and an ensemble of random forests that have excellent performance in the tasks of semantic parsing and structural information extraction. CNN_Forest uses CNN with multiple filters and an ensemble of random forests with multi-grained scanning to extract semantic and structural features from the word vectors derived from bug reports and source files. And a subsequent cascade forest (a cascade of ensembles of random forests) is used to further extract deeper features and observe the correlated relationships between bug reports and source files. CNN_Forest is capable of defining the correlated relationships between bug reports and source files, and we empirically show that semantic and structural information in bug reports and source files are crucial in improving bug localization.
Our experiments indicate that the proposed models can increase the accuracy of bug localization and empirically demonstrate the significance of including both semantic and structural information in bug localization and bridging the lexical gap between bug reports and source files. We hope that the promising results will interest and encourage more in-depth research in these problems to further improve the accuracy of bug localization.