Computational Methods for High-resolution Microbial Composition Analysis and Relevant Applications

高分辨率微生物組成分析的計算方法及其相關應用

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date12 Jan 2024

Abstract

Microbes refer to a wide range of microorganisms, including bacteria, viruses, fungi, and archaea, collectively shaping the microbial world. The study of microbes has gained increasing attention in recent years, as it has become evident that these complex microbial communities play important roles in human health, the environment, and biogeochemical cycling. High-throughput sequencing techniques have allowed for the characterization of microbial communities inhabiting different environments. An important high-throughput sequencing technique, named metagenomic sequencing, can sequence all DNA molecules from an input sample, providing an important data source to study the microbes. Another technique called metatranscriptomic sequencing can sequence all RNA molecules from microorganisms in a sample, which enables the detection and analysis of RNA viruses in diverse environments. Consequently, there exist numerous methods for analyzing microbial composition from sequencing data. Despite the availability of methods optimized for species-level microbial composition analysis, there is still a pressing need for fine-level composition analysis at the strain level. This level of resolution is essential because it enables the identification and characterization of specific strains with distinct phenotypic and genotypic properties. For example, different strains of the same virus can have different virulence, transmissibility, and antigenicity, which can affect their pathogenicity and immune evasion ability. Similarly, different strains of bacteria can have different metabolic pathways, antibiotic resistance, and virulence factors, which can impact their interactions with the environment in which they inhabit. Thus, identifying and characterizing these strains may reveal important biological insights that are not apparent at the species level. However, the complexity of microbial data presents challenges for high-resolution composition analysis and interpretation. In this work, we develop novel algorithms for identifying and quantifying viral and bacterial strains, which are crucial for understanding the potentially functional roles of individual microorganisms within a community. These tools output strain-level microbial compositional data, which is an important addition to the species-level microbial compositional data.

Viruses are the most abundant members of the microbes, with numerous strains characterized by distinct properties. Furthermore, viruses are known to have a significant impact on human health, agriculture, and the environment. Thus, we first aim to identify viral strains in given samples. During the replication process, viruses undergo constant genetic changes, resulting in a high level of diversity within the population. Although many of these changes are synonymous, some can provide the virus with advantageous biological properties, such as enhanced adaptability. Additionally, viral genotypes often contain metadata, such as the host they reside in, which can aid in determining the transmission of viruses during pandemics. Therefore, performing subspecies analysis can yield valuable insights into the characterization of viruses. Here, we developed VirStrain, a tool that takes short reads (also known as next-generation sequencing data, or NGS data) as input and produces the composition of viral strains as output. We conducted rigorous tests of VirStrain on multiple simulated and real virus sequencing datasets. We found that it outperforms existing state-of-the-art tools in terms of both sensitivity and accuracy.

Just like viral strains, bacterial strains within the same species can exhibit varying biological properties, making strain-level composition analysis a crucial step in comprehending the dynamics of microbial communities. Metagenomic sequencing has emerged as the primary method for probing microbial composition in host-associated or environmental samples. However, existing composition analysis tools are not optimized to handle the challenges posed by strain-level analysis, such as the presence of multiple strains under one species in a sample and the need for a reference database with highly similar reference strain genomes. To address these issues, we developed StrainScan, a new bacterial strain identification tool that uses a novel tree-based k-mer indexing structure to balance strain identification accuracy and computational complexity. We rigorously tested StrainScan on numerous simulated and real sequencing datasets and compared it with popular strain-level analysis tools. Our results demonstrate that StrainScan outperforms state-of-the-art tools in terms of accuracy and resolution in strain-level composition analysis. Specifically, it improves the F1-score by 20% in identifying multiple strains with at least 99.89% average nucleotide identity.

Given the composition information of these microorganisms in the sequencing data, a crucial issue is determining their impact on human health. With advances in metagenomic sequencing technologies, numerous studies have investigated the associations between the human gut microbes and various human diseases. These associations offer insights into using gut microbial compositional data to differentiate case and control samples of specific diseases, also known as host disease status classification. Learning-based models that distinguish disease and control samples are expected to identify important biomarkers more accurately than abundance-based statistical analysis. However, available tools have not fully addressed two challenges associated with this task: limited labeled microbial compositional data and decreased accuracy in cross-studies. Confounding factors, such as diet and technical biases in sample collection/sequencing across different studies/cohorts, often limit the generalization of the learning model. To overcome these challenges, we developed GDmicro, a new tool that combines semi-supervised learning and domain adaptation to generate a more generalized model using limited labeled samples. We evaluated GDmicro on human gut microbial compositional data from 10 cohorts covering 5 different diseases and found that it outperforms state-of-the-art tools in terms of performance and robustness. Specifically, it improves the AUC from 0.783 to 0.949 in identifying inflammatory bowel disease. Furthermore, GDmicro can identify potential biomarkers with greater accuracy than abundance-based statistical analysis methods and sheds light on their contribution to the host's disease status.

In this work, we developed a viral strain identification tool VirStrain, a bacterial strain identification tool StrainScan, and a microbial-based host disease classification method GDmicro. We applied these tools to large-scale real sequencing samples, including pathogen-infected patients, individuals from different countries, colorectal cancer cohorts, etc. The experimental results demonstrate these tools have the potential to enhance our comprehension of microbes and facilitate the development of diagnostic and therapeutic strategies for diseases.