Code Representation Learning and Its Applications
代碼表示學習及其應用
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 29 Aug 2023 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(bcc2007b-7f0e-4cb5-880d-6e96a49bcb8d).html |
---|---|
Other link(s) | Links |
Abstract
Code representation learning has been an important technology to bridge various downstream tasks in the field of intelligent software engineering, such as software defect prediction, code summarization, code-comment synchronization, and code clone detection. The process of code representation learning can be partitioned into two stages, i.e., code representation and representation learning. Code representation can be constructed as token-based sequences, Abstract Syntax Trees (ASTs), control flow-based graphs, and expert-formulated features, while the learning algorithms of these representations are also diverse, including machine learning, deep learning, and reinforcement learning.
In this thesis, we focus on three specific downstream software engineering tasks, i.e., (1) smart contract vulnerability auditing, (2) code summarization, and (3) code-comment synchronization. Meanwhile, we propose corresponding code representation learning approaches to resolve their existing research challenges. To be specific, smart contract vulnerability auditing aims to detect vulnerability-prone modules before the execution of the smart contracts on the blockchain, thereby ensuring the quality and security of Decentralized Applications (DApps). Code summarization concentrates on automatically generating a piece of Natural Language Description (NLD) to present the functionality of a given code snippet, while code-comment synchronization targets to automatically synchronize comments with code changes, which are both helpful for developers to comprehend code functionalities and maintain code repositories. Although numerous studies have been dedicated to the tasks above during the last several years, many pervasive and longstanding research challenges remain unsolved.
Towards smart contract vulnerability auditing, current data-driven approaches normally represent smart contract source code with a series of sequences according to only one tokenization standard, resulting in some of the semantic contexts not being reflected within restricted sequence length. To address this limitation, we generate sequences from smart contracts in three tokenization standards (i.e., text only, structure only, and combining both above). Subsequently, we utilize the n-gram language model to capture semantic contexts respectively, and finally exploit our effective combination strategy of Intersection or Union to integrate the audited results from multiple semantic contexts. Consequently, code representation for smart contracts can be learned from multiple perspectives, and the accuracy of vulnerability diagnosis is also improved by a large margin.
In the area of code summarization, effectively representing source code and capturing its long-range dependencies have not been solved satisfactorily. Hence, we propose a Multi-Modal Transformer-based (MMTrans) code summarization approach. Specifically, MMTrans learns the representation of source code from the two heterogeneous modalities of the AST, i.e., Structure-Based Traversal (SBT) sequences and graphs. The SBT sequence provides the global semantic information of AST, while the graph convolution focuses on the local details. MMTrans uses two encoders to extract both global and local semantic information from the two modalities, respectively. Then it designs a joint decoder to generate code comments. Both encoders and the decoder employ the multi-head attention structure of the Transformer to enhance the ability of capturing the long-range dependencies between code tokens. Therefore, MMTrans can learn the code representation well and take effect on the code summarization task.
For the Code-Comment Synchronization (CCS) task, solutions of a single type (i.e., either deep learning-based or heuristic-based) cannot successfully handle all kinds of samples in complex realistic situations. An intuitive rescue is allocating Code-Comment Inconsistent (CCI) samples to their suitable CCS models guided by an accurate classification. To this end, we propose a composite approach named CBS (i.e., Classifying Before Synchronizing) to further improve the code-comment synchronization performance, which combines the advantages of deep learning-based (i.e., CUP) and heuristic-based (i.e., HebCUP) models via the assistance of inferred categories of CCI samples. Specifically, we first define two categories (i.e., heuristic-prone and non-heuristic-prone) for CCI samples and propose five features to represent code changes and code-comment connections, thereby assisting category prediction. The samples whose comments can be correctly synchronized by HebCUP are heuristic-prone, while others are non-heuristic-prone. Then, CBS employs our proposed Multi-Subsets Ensemble Learning (MSEL) classification algorithm to alleviate the class imbalance problem and construct the category prediction model via learning from manual-crafted features. Next, CBS uses the trained MSEL to predict the category of the new sample. If the predicted category is heuristic-prone, CBS employs HebCUP to conduct the code-comment synchronization for the sample, otherwise, CBS allocates CUP to handle it. The key strength of CBS lies in its ability to accurately allocate CCI samples, which is attributed to the representation of code-comment changes based on manual-crafted features and the learning of this representation.
We have conducted extensive experiments that clearly demonstrate the superiority of our proposed approaches in each research area over their corresponding state-of-the-art baselines by a significant margin. These results highlight the practicality of each approach for software development and maintenance. We hope that the promising outcomes of this thesis will attract interest and motivate further comprehensive research in these domains, ultimately leading to increased automation of software development and maintenance procedures and a substantial reduction in manual efforts.
In this thesis, we focus on three specific downstream software engineering tasks, i.e., (1) smart contract vulnerability auditing, (2) code summarization, and (3) code-comment synchronization. Meanwhile, we propose corresponding code representation learning approaches to resolve their existing research challenges. To be specific, smart contract vulnerability auditing aims to detect vulnerability-prone modules before the execution of the smart contracts on the blockchain, thereby ensuring the quality and security of Decentralized Applications (DApps). Code summarization concentrates on automatically generating a piece of Natural Language Description (NLD) to present the functionality of a given code snippet, while code-comment synchronization targets to automatically synchronize comments with code changes, which are both helpful for developers to comprehend code functionalities and maintain code repositories. Although numerous studies have been dedicated to the tasks above during the last several years, many pervasive and longstanding research challenges remain unsolved.
Towards smart contract vulnerability auditing, current data-driven approaches normally represent smart contract source code with a series of sequences according to only one tokenization standard, resulting in some of the semantic contexts not being reflected within restricted sequence length. To address this limitation, we generate sequences from smart contracts in three tokenization standards (i.e., text only, structure only, and combining both above). Subsequently, we utilize the n-gram language model to capture semantic contexts respectively, and finally exploit our effective combination strategy of Intersection or Union to integrate the audited results from multiple semantic contexts. Consequently, code representation for smart contracts can be learned from multiple perspectives, and the accuracy of vulnerability diagnosis is also improved by a large margin.
In the area of code summarization, effectively representing source code and capturing its long-range dependencies have not been solved satisfactorily. Hence, we propose a Multi-Modal Transformer-based (MMTrans) code summarization approach. Specifically, MMTrans learns the representation of source code from the two heterogeneous modalities of the AST, i.e., Structure-Based Traversal (SBT) sequences and graphs. The SBT sequence provides the global semantic information of AST, while the graph convolution focuses on the local details. MMTrans uses two encoders to extract both global and local semantic information from the two modalities, respectively. Then it designs a joint decoder to generate code comments. Both encoders and the decoder employ the multi-head attention structure of the Transformer to enhance the ability of capturing the long-range dependencies between code tokens. Therefore, MMTrans can learn the code representation well and take effect on the code summarization task.
For the Code-Comment Synchronization (CCS) task, solutions of a single type (i.e., either deep learning-based or heuristic-based) cannot successfully handle all kinds of samples in complex realistic situations. An intuitive rescue is allocating Code-Comment Inconsistent (CCI) samples to their suitable CCS models guided by an accurate classification. To this end, we propose a composite approach named CBS (i.e., Classifying Before Synchronizing) to further improve the code-comment synchronization performance, which combines the advantages of deep learning-based (i.e., CUP) and heuristic-based (i.e., HebCUP) models via the assistance of inferred categories of CCI samples. Specifically, we first define two categories (i.e., heuristic-prone and non-heuristic-prone) for CCI samples and propose five features to represent code changes and code-comment connections, thereby assisting category prediction. The samples whose comments can be correctly synchronized by HebCUP are heuristic-prone, while others are non-heuristic-prone. Then, CBS employs our proposed Multi-Subsets Ensemble Learning (MSEL) classification algorithm to alleviate the class imbalance problem and construct the category prediction model via learning from manual-crafted features. Next, CBS uses the trained MSEL to predict the category of the new sample. If the predicted category is heuristic-prone, CBS employs HebCUP to conduct the code-comment synchronization for the sample, otherwise, CBS allocates CUP to handle it. The key strength of CBS lies in its ability to accurately allocate CCI samples, which is attributed to the representation of code-comment changes based on manual-crafted features and the learning of this representation.
We have conducted extensive experiments that clearly demonstrate the superiority of our proposed approaches in each research area over their corresponding state-of-the-art baselines by a significant margin. These results highlight the practicality of each approach for software development and maintenance. We hope that the promising outcomes of this thesis will attract interest and motivate further comprehensive research in these domains, ultimately leading to increased automation of software development and maintenance procedures and a substantial reduction in manual efforts.
- Code Representation Learning, Vulnerability Detection, Code Summarization, Code-Comment Synchronization, Natural Language Processing