Fight Fire with Fire : How Much Can We Trust ChatGPT on Source Code-Related Tasks?

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

View graph of relations

Author(s)

  • Xiao Yu
  • Lei Liu
  • Xing Hu
  • Jin Liu
  • Xin Xia

Related Research Unit(s)

Detail(s)

Original languageEnglish
Pages (from-to)3435-3453
Journal / PublicationIEEE Transactions on Software Engineering
Volume50
Issue number12
Online published5 Nov 2024
Publication statusPublished - Dec 2024

Abstract

With the increasing utilization of large language models such as ChatGPT during software development, it has become crucial to verify the quality of code content it generates. Recent studies proposed utilizing ChatGPT as both a developer and tester for multi-agent collaborative software development. The multi-agent collaboration empowers ChatGPT to produce test reports for its generated code, enabling it to self-verify the code content and fix bugs based on these reports. However, these studies did not assess the effectiveness of the generated test reports in validating the code. Therefore, we conduct a comprehensive empirical investigation to evaluate ChatGPT's self-verification capability in code generation, code completion, and program repair. We request ChatGPT to (1) generate correct code and then self-verify its correctness; (2) complete code without vulnerabilities and then self-verify for the presence of vulnerabilities; and (3) repair buggy code and then selfverify whether the bugs are resolved. Our findings on two code generation datasets, one code completion dataset, and two program repair datasets reveal the following observations: (1) ChatGPT often erroneously predicts its generated incorrect code as correct, its vulnerable completed code as non-vulnerable, and its failed program repairs as successful during its self-verification. (2) The self-contradictory hallucinations in ChatGPT's behavior arise: (a) ChatGPT initially generates code that it believes to be correct but later predicts it to be incorrect; (b) ChatGPT initially generates code completions that it deems secure but later predicts them to be vulnerable; (c) ChatGPT initially outputs code that it considers successfully repaired but later predicts it to be buggy during its self-verification. (3) The self-verification capability of ChatGPT can be enhanced by asking the guiding question, which queries whether ChatGPT agrees with assertions about incorrectly generated or repaired code and vulnerabilities in completed code. (4) Using test reports generated by ChatGPT can identify more vulnerabilities in completed code, but the explanations for incorrectly generated code and failed repairs are mostly inaccurate in the test reports. Based on these findings, we provide implications for further research or development using ChatGPT. © 2024 IEEE.

Research Area(s)

  • ChatGPT, code completion, code generation, Empirical study, program repair, self-verification

Citation Format(s)

Fight Fire with Fire: How Much Can We Trust ChatGPT on Source Code-Related Tasks? / Yu, Xiao; Liu, Lei; Hu, Xing et al.
In: IEEE Transactions on Software Engineering, Vol. 50, No. 12, 12.2024, p. 3435-3453.

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review