Epidemiologic information discovery from open-access COVID-19 case reports via pretrained language model

Zhizheng Wang, Xiao Fan Liu, Zhanwei Du, Lin Wang*, Ye Wu, Petter Holme, Michael Lachmann, Hongfei Lin, Zoie S.Y. Wong*, Xiao-Ke Xu*, Yuanyuan Sun*

*Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

22 Citations (Scopus)
47 Downloads (CityUHK Scholars)

Abstract

Although open-access data are increasingly common and useful to epidemiological research, the curation of such datasets is resource-intensive and time-consuming. Despite the existence of a major source of COVID-19 data, the regularly disclosed case reports were often written in natural language with an unstructured format. Here, we propose a computational framework that can automatically extract epidemiological information from open-access COVID-19 case reports. We develop this framework by coupling a language model developed using deep neural networks with training samples compiled using an optimized data annotation strategy. When applied to the COVID-19 case reports collected from mainland China, our framework outperforms all other state-of-the-art deep learning models. The information extracted from our approach is highly consistent with that obtained from the gold-standard manual coding, with a matching rate of 80%. To disseminate our algorithm, we provide an open-access online platform that is able to estimate key epidemiological statistics in real time, with much less effort for data curation.
Original languageEnglish
Article number105079
JournaliScience
Volume25
Issue number10
Online published5 Sept 2022
DOIs
Publication statusPublished - 21 Oct 2022

Research Keywords

  • Artificial intelligence
  • Health sciences
  • Machine learning
  • Virology

Publisher's Copyright Statement

  • This full text is made available under CC-BY-NC-ND 4.0. https://creativecommons.org/licenses/by-nc-nd/4.0/

Fingerprint

Dive into the research topics of 'Epidemiologic information discovery from open-access COVID-19 case reports via pretrained language model'. Together they form a unique fingerprint.

Cite this