Repo4QA: Answering Coding Questions via Dense Retrieval on GitHub Repositories

Minyu Chen, Guoqiang Li*, Chen Ma, Jingyang Li, Hongfei Fu

*Corresponding author for this work

Research output: Chapters, Conference Papers, Creative and Literary WorksRGC 32 - Refereed conference paper (with host publication)peer-review

3 Citations (Scopus)
132 Downloads (CityUHK Scholars)

Abstract

Open-source platforms such as GitHub and Stack Overflow both play significant roles in current software ecosystems. It is crucial but time-consuming for developers to raise programming questions in coding forums such as Stack Overflow and be navigated to actual solutions on GitHub repositories. In this paper, we dedicate to accelerating this activity. We find that traditional information retrieval based methods fail to handle the long and complex questions in coding forums, and thus cannot find suitable coding repositories. To effectively and efficiently bridge the semantic gap between repositories and real-world coding questions, we introduce a specialized dataset named Repo4QA, which includes over 12,000 question-repository pairs constructed from Stack Overflow and GitHub. Furthermore, we propose QuRep, a CodeBERT-based model that jointly learns the representation of both questions and repositories. Experimental results demonstrate that our model simultaneously captures the semantic features in both questions and repositories through supervised contrastive loss and hard negative sampling. We report that our approach outperforms existing state-of-art methods by 3%-8% on MRR and 5%-8% on P@1. © 2022 Proceedings - International Conference on Computational Linguistics, COLING.
Original languageEnglish
Title of host publicationProceedings of the 29th International Conference on Computational Linguistic
Pages1580-1592
Publication statusPublished - Oct 2022
Event29th International Conference on Computational Linguistics (COLING 2022) - https://coling2022.org/, Gyeongju, Korea, Republic of
Duration: 12 Oct 202217 Oct 2022

Publication series

NameProceedings - International Conference on Computational Linguistics, COLING
Number1
Volume29
ISSN (Print)2951-2093

Conference

Conference29th International Conference on Computational Linguistics (COLING 2022)
Abbreviated titleCOLING’2022
PlaceKorea, Republic of
CityGyeongju
Period12/10/2217/10/22

Publisher's Copyright Statement

  • This full text is made available under CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/

Fingerprint

Dive into the research topics of 'Repo4QA: Answering Coding Questions via Dense Retrieval on GitHub Repositories'. Together they form a unique fingerprint.

Cite this