Abstract
Although open-access data are increasingly common and useful to epidemiological research, the curation of such datasets is resource-intensive and time-consuming. Despite the existence of a major source of COVID-19 data, the regularly disclosed case reports were often written in natural language with an unstructured format. Here, we propose a computational framework that can automatically extract epidemiological information from open-access COVID-19 case reports. We develop this framework by coupling a language model developed using deep neural networks with training samples compiled using an optimized data annotation strategy. When applied to the COVID-19 case reports collected from mainland China, our framework outperforms all other state-of-the-art deep learning models. The information extracted from our approach is highly consistent with that obtained from the gold-standard manual coding, with a matching rate of 80%. To disseminate our algorithm, we provide an open-access online platform that is able to estimate key epidemiological statistics in real time, with much less effort for data curation.
| Original language | English |
|---|---|
| Article number | 105079 |
| Journal | iScience |
| Volume | 25 |
| Issue number | 10 |
| Online published | 5 Sept 2022 |
| DOIs | |
| Publication status | Published - 21 Oct 2022 |
Research Keywords
- Artificial intelligence
- Health sciences
- Machine learning
- Virology
Publisher's Copyright Statement
- This full text is made available under CC-BY-NC-ND 4.0. https://creativecommons.org/licenses/by-nc-nd/4.0/