Epidemiologic information discovery from open-access COVID-19 case reports via pretrained language model

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

16 Scopus Citations
View graph of relations

Author(s)

  • Zhizheng Wang
  • Zhanwei Du
  • Lin Wang
  • Ye Wu
  • Petter Holme
  • Michael Lachmann
  • Hongfei Lin
  • Zoie S.Y. Wong
  • Xiao-Ke Xu
  • Yuanyuan Sun

Related Research Unit(s)

Detail(s)

Original languageEnglish
Article number105079
Journal / PublicationiScience
Volume25
Issue number10
Online published5 Sept 2022
Publication statusPublished - 21 Oct 2022

Link(s)

Abstract

Although open-access data are increasingly common and useful to epidemiological research, the curation of such datasets is resource-intensive and time-consuming. Despite the existence of a major source of COVID-19 data, the regularly disclosed case reports were often written in natural language with an unstructured format. Here, we propose a computational framework that can automatically extract epidemiological information from open-access COVID-19 case reports. We develop this framework by coupling a language model developed using deep neural networks with training samples compiled using an optimized data annotation strategy. When applied to the COVID-19 case reports collected from mainland China, our framework outperforms all other state-of-the-art deep learning models. The information extracted from our approach is highly consistent with that obtained from the gold-standard manual coding, with a matching rate of 80%. To disseminate our algorithm, we provide an open-access online platform that is able to estimate key epidemiological statistics in real time, with much less effort for data curation.

Research Area(s)

  • Artificial intelligence, Health sciences, Machine learning, Virology

Citation Format(s)

Epidemiologic information discovery from open-access COVID-19 case reports via pretrained language model. / Wang, Zhizheng; Liu, Xiao Fan; Du, Zhanwei et al.
In: iScience, Vol. 25, No. 10, 105079, 21.10.2022.

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

Download Statistics

No data available