Identification of critical SARS-CoV-2 amino acids associated with COVID-19 hospitalization rate using machine learning and statistical modeling : An observational study in the United States

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

View graph of relations



Original languageEnglish
Article number105480
Journal / PublicationInfection, Genetics and Evolution
Online published10 Jul 2023
Publication statusPublished - Sept 2023



Background: The COVID-19 pandemic has put many medical systems on the verge of collapse in the last two years. Virus mutation was one of the important factors affecting the COVID-19 infection severity and hospitalizations. Although over ten thousand SARS-CoV-2 mutations being reported since the beginning of the COVID-19 pandemic, only a small percentage of mutations are likely to affect the virus phenotype and change its severity. Finding out which amino acids have the greatest impact on COVID-19 hospitalization rate is an important research question.
Methods: This observational study used the COVID-19 case hospitalization ratio (CHR) to represent the virus severity related with hospitalization. The database is based on the daily state-level epidemiological and genomic sequential data in the United States from the Alpha wave to the first Omicron wave. The critical amino acids that mostly affected the CHR were determined by using four types of models including extreme gradient boosting decision trees (XGBoost), artificial neural networks (ANNs), logistic regression and Lasso regression models.
Results: The XGBoost, ANN, logistic regression, and Lasso regression models all produce excellent results (mean square error for all state-level models does not exceed 0.0008 using the testing dataset). Based on the rank of importance of all covariates, the critical amino acids most affecting the CHR were identified, including T19, L24, P25, P26, A27, A67, H69, V70, T95, G142, V143, Y145, E156, F157, N211, L212, V213, R214, D215, G339, R346, S373, L452, S477, T478, E484, N501, A570, P681, and T716.
Conclusion: This study identified critical amino acids that are most likely to affect the hospitalization rate, allowing public health workers to monitor these highly risky amino acids and raise an alarm immediately when more severe mutations occur. Furthermore, the methodology and results may be extended to other regions. © 2023 The Authors.

Research Area(s)

  • Case hospitalization ratio, COVID-19, SARS-CoV-2 amino acid mutation

Citation Format(s)

Download Statistics

No data available