Strain-level Composition Analysis for RNA Viruses

Project: Research

View graph of relations


RNA viruses are responsible for many recent epidemics, such as SARS, Ebola, andCOVID-19. Because of their sloppy replication processes, RNA viruses change constantly,which may lead to new strains with different biological functions. Thus, characterizingviral strains is crucial for studying RNA viruses.  To conduct strain-level composition analysis for RNA viruses, we propose to developnovel methods to characterize known and new strains in high-throughput sequencingdatasets. Although there exists a plethora of algorithms for strain-level analysis, amajority of them were designed for bacteria. Their reliance on bacteria-specificproperties makes it difficult to repurpose them for viral strain analysis. In addition,there are unique challenges for viral strain characterization. First, as multi-straininfection is possible for RNA viruses, algorithms that can recover all strains arepreferred. Second, many reference genomes share high sequence similarities, requiringboth delicate and fast algorithms to distinguish them for known strain identification.  To address these challenges, we will develop tools that capitalize on the properties ofviral genomes and novel applications of sequence classification and graph structures.First, we propose a k-mer set learning algorithm based on a novel formulation of thedecision tree construction problem. Exploiting the main goal of decision tree constructionalgorithms in reducing the expected number of tests, we will build a tree using k-mermatches as the tests and the reference genomes as the classes, which enables us toobtain k-mer sets to distinguish similar genomes with high accuracy and efficiency.Second, to our best knowledge, we are one of the first to apply genome graph, acompressible collapsed representation, to embed strain genomes of viruses. By combininggenome graph and error-profile incorporated sequence-graph alignment, we willpinpoint the correct strain in third-generation sequencing (TGS) data with highaccuracy. Third, we will reconstruct new strain genomes using TGS data without relyingon reference genomes. We will distinguish true variants from sequencing errors based onour key observation that sequencing errors heavily disrupt coding abilities, which can bequantified using gene-finding hidden Markov models for short and gene-rich viralgenomes. We thus propose a new graph search algorithm for new strain reconstruction.This method will enable us to recover multiple strains with higher accuracy thanavailable error-correction and haplotype reconstruction tools.  The tools to be developed are expected to provide faster and more complete strain composition analysis than existing methods, thereby enabling high-resolution surveillance of the viral strains. 


Project number9043170
Grant typeGRF
Effective start/end date1/01/22 → …