Computational Methods for Strain-Level Microbiota Characterization


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date5 Sept 2022


Strain-level variations of human microbiota are essential to human health and disease. Strains of the same species can possess significantly different functional capacities; many specific microbial strains have been proposed to associate with human diseases. Moreover, horizontal gene transfer (HGT) reshapes the gene content of strains and impacts the functions of bacteria, including antibiotic resistance and pathogenicity. However, up to now, microbiota characterization methods are underdeveloped at the strain level. In this work, we developed novel computational methods to facilitate microbiota characterization at the strain level. Furthermore, we applied the methods to analyze the cohorts of colorectal cancer (CRC), a type of cancer highly associated with human microbiota.

Human microbiota contains many highly related strains which are hard to distinguish. Genotype frequencies could be adopted to speculate strain genotypes at the same variant locus. Also, the variants in different loci covered by the same reads could provide evidence that they reside on the same strain. Nevertheless, no existing methods can simultaneously employ these two types of phase information. Here, we developed PStrain, a strain profiling method that utilizes the definition of second-order genotype frequency to integrate the phase information from sequencing reads and genotype frequencies. The benchmark indicated that PStrain infers strains abundances and genotypes more accurately than state-of-the-art methods. After applying PStrain in CRC datasets, we found the specific strain of Bacteroides coprocola is associated with CRC.

HGT detection from metagenomics data is slow and computationally expensive. To solve the problem, we developed LocalHGT, which ultrafast detects HGT with a method similar to structural variation calling. LocalHGT accelerates the alignment step by reducing the reference database to a small HGT-related reference using approximate k-mer matches. In the benchmark, compared to the traditional alignment-based tool, LocalHGT performed over four-fold faster with comparable accuracy and took less computational memory with the large reference database.

We next performed LocalHGT on eight CRC cohorts to systematically profile HGTs. We observed that the HGT network of CRC samples varies from the controls significantly. The decreased HGT events between members of the family Lachnospiraceae might promote CRC carcinogenesis. We constructed a CRC classifier using HGT occurrence and species abundance as biomarkers. The average area under the curve of the classifier achieved 0.87 in the leave-one-dataset-out analysis.

In this work, we developed a strain profiling method PStrain and a fast HGT detection method LocalHGT. We applied the two approaches to analyze gut microbiota at the strain level in CRC cohorts and obtained some novel insights about CRC. Our results claimed that strain-level characterization methods could deepen our understanding of human microbiota and facilitate the development of disease diagnosis and treatment.

    Research areas

  • metagenomics, Strain, HGT, CRC