Strain-level Metagenomic Analysis Using PacBio Sequencing

Project: Research

View graph of relations


The third generation sequencing platforms such as the single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than second-generation sequencing technologies such as Illumina. The long read length has a great potential to distinguish different strains with higher accuracy. However, the methods and tools of using long reads for metagenomic analysis have not been carefully explored, especially on the strain-level analysis. Data specific challenges such as high error rates and low coverage of long reads must be addressed for us to take full advantage of the read length. The focus of this proposal is to deliver a suite of methods and tools for fast and accurate strain-level metagenomic analysis. We propose a gene-centric strain-level analysis by distinguishing genes from different strains. There are three key components: 1) Fast protein family annotation by integrating deep learning with abundant protein sequence families. 2) Error correction in classified reads using an augmented Viterbi algorithm. 3) Strain reconstruction by distinguish different coverage distributions. Keywords: metagenomics; pacbio reads; error correction; homology search; strain-level analysis. 


Project number7200620
Effective start/end date1/02/19 → …