Characterizing Quasispecies of Known and Novel Viruses from Metagenomic Data

Project: Research

View graph of relations


Pathogenic RNA viruses, such as HIV and HCV, still claim millions of lives each year despite intensive research. It is estimated that billions of viral particles of fast replicating viruses can be generated each day in chronically infected patients. Due to the high mutation rate of RNA viruses, these viral particles can contain an unknown number of different strains. If some of the strains possess resistance to anti-viral drugs, the treatment will fail. Therefore, inferring all different strains within a viral population can provide indispensable knowledge for preventing and treating viral diseases.In contrast to the methods that sequence only targeted viruses, metagenomic sequencing is able to sequence all viruses in a sample, including novel ones. The focus of this research plan is to develop much-needed approaches for deriving all strains in viral populations of both known and novel viruses using metagenomic data. To our best knowledge, there lack such high-resolution metagenomic analysis tools optimized for RNA viruses.Toward this goal, we must tackle the following challenges. First, novel viruses or divergent strains of a known virus family may only share low conservation with reference sequences, making widely used reference-based virus detection tools miss a large number of reads from novel viruses or strains. Second, despite promising results on gene-wise or local strain recovery, genome-scale reconstruction of viral strains is still difficult because of the high sequence identity between strains within the population.Based on our 10+ years of experience in sequence analysis and our deep understanding of the viral population properties through our preliminary work on viral strain assembly, we propose three objectives to address these challenges.1) Build the smallest possible marker gene set for RNA viruses in order to maximize the sensitivity of species-level virus composition analysis. We propose a novel formulation that models marker gene derivation as the vertex cover problem in a gene similarity graph.2) Develop highly scalable read classification methods to identify reads from novel viruses or strains.3) Infer the number of strains, their relative abundance, and genomes inside each viral population by developing a novel EM algorithm that integrates strains’ global similarity and different abundance.The tools are expected to reveal new strains and viruses that could not be identified by existing tools. As novel viruses usually emerge in densely populated regions with high biodiversity such as Hong Kong, our research outcome is needed to better prepare us for future viral outbreaks.


Project number9042828
Grant typeGRF
Effective start/end date1/01/20 → …