Computational Methods for Accurate Reconstruction of RNA Viral Genomes Using Third-generation Sequencing Data

用三代測序數據精確重建 RNA病毒基因組的計算方法

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date6 Mar 2024

Abstract

Ribonucleic acid (RNA) viruses are the most abundant group of parasites that infect both prokaryotes and eukaryotes cells. They are responsible for a wide range of diseases such as influenza, hepatitis C, dengue, coronavirus disease 2019 (COVID-19), malaria, etc. Thus, RNA viruses are of huge interest in research. Due to the lack of proofreading mechanisms during replication, many RNA viruses undergo genetic changes (such as mutations, insertions, and deletions), resulting in many viral variants and strains. While most variants are neutral, some can lead to higher transmissibility or virulence. Therefore, obtaining accurate and complete viral genomes enables the identification of pathogen variants and lays the foundation for research such as studying virus evolution and linking genotypes and phenotypes.

Whole-genome sequencing (WGS) is essential for sequencing viral variants and strains. Next-generation sequencing (NGS) provides quick and cost-effective access to DNA and RNA sequences, but short reads make assembly challenging. Third-generation sequencing (TGS) platforms like Pacific BioSciences (PacBio) and Oxford Nanopore Technologies (ONT) produce long reads, enabling the reconstruction of diverged or new strains. TGS also offers real-time and portable sequencing, suitable for remote areas. However, TGS has a higher per-base error rate than NGS. Distinguishing true variants from sequencing errors is still challenging for viral genome assembly from long reads.

This thesis introduces computational methods that leverage long reads to accurately reconstruct RNA virus genomes. Firstly, we present a novel tool called AccuVIR, designed to assemble complete viral genomes from TGS-sequenced WGS data. Additionally, we introduce HMMPolish, a tool that utilizes profile Hidden Markov Models (pHMM) to enhance the precision of coding regions. These two methods have different utilities. AccuVIR can be applied to both known and novel viruses as it does not require any reference genomes of the targeted virus. HMMPolish can provide more accurate genome but it is designed to construct the variants of known viruses. Lastly, we provide an in-depth review of widely employed pHMM databases, discussing their practicality in viral genome research.

Our first designed tool, AccuVIR, can be applied to known or novel viruses to produce a highly accurate viral genome. It achieves this goal by leveraging the read alignment graph and the diverse beam search algorithm in the graph to generate high-quality sequences in the first phase. Then, AccuVIR utilizes the observation that sequencing errors can disrupt gene-finding outputs for viral genomes. Because RNA viral genomes are small and have a high density of coding regions, the disruption of gene finding is more prominent in viruses than in other species. Based on this, we use mean reciprocal rank to output the final sequence with the fewest errors. We test AccuVIR on both simulated and real TGS sequencing data and benchmark it against popular TGS assembly and polishing tools. The results show that our tool can produce high-quality viral genomes for different RNA viruses and sequencing data of different sequencing coverage and read length.

Apart from the assembly of the whole genome, the accurate reconstruction of specific proteins of RNA viruses is also of great interest. Understanding viral proteins and their functions is fundamental to many applications such as designing vaccines, developing antiviral drugs, and suppressing antimicrobial resistance. For this purpose, we present a pipeline, HMMPolish, to correct (polish) the protein-coding regions of RNA viruses sequenced via TGS. This tool is designed for known viruses with an established repertoire of reference sequences. The input of HMMPolish consists of three parts: the reads containing sequencing errors, the draft sequence that is either viral contig obtained via third-party assembly tools or the longest read, and a profile Hidden Markov Model (pHMM) of the coding region to be polished. The output is the viral sequence with polished protein-coding regions. By utilizing pHMMs of protein families/domains, HMMPolish can correct errors that are ignored by available polishers. We validate HMMPolish on 34 datasets that covered four clinically important viruses, including HIV-1, influenza-A, norovirus, and severe acute respiratory syndrome coronavirus 2. These datasets contain reads with different properties, such as sequencing depth and platforms (PacBio or Nanopore). The benchmark results against popular/representative polishers show that HMMPolish competes favorably on error correction in coding regions of known RNA viruses.

A key component of our tool HMMPolish is the profile Hidden Markov Model, which is widely used in bioinformatics to identify conserved domains and motifs within protein sequences with higher sensitivity than pairwise-comparison methods. There are several pHMM databases, including those designed specifically for viral proteins and those that cover all species. However, the scope, focus, and training sequences of different databases can differ significantly. Thus, we present a thorough review and evaluation of the commonly used profile HMM databases. Our review and evaluation of profile Hidden Markov Model databases aim to provide researchers with a comprehensive and critical assessment of the strengths and limitations of different databases. We also provide users with practical suggestions for using these databases in research of RNA viruses as well as other viruses.

In summary, we studied the limitations of using TGS data for viral genome reconstruction and proposed two different methods to tackle the challenge of removing sequencing errors from TGS data for RNA viruses. Our methods have robust outputs and outperform other state-of-the-art tools across different experiments, laying the foundation for a better understating of virus evolution and other research topics of RNA viruses.