Application of Bioinformatics Methods in Human Gut Microbiome Sequencing Data


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date15 Jun 2021


The number of publications of the human gut microbiome is 12,900 between 2013 to 2017, four-fifths of the total number of this topic's publications in the last 40 years. The researchers had discovered the relationship between the human gut microbiome with lots of diseases. In the early stage, researchers focus on the diseases directly related to the GI tract, such as inflammatory bowel disease, liver disease, and colorectal cancer. After that, metabolic diseases come to the stage as the GI tract plays an essential role in metabolism. Several studies showed how the gut microbiome alteration in obesity and type 2 diabetes patients. Immune diseases such as rheumatoid arthritis, lupus erythematosus, myasthenia gravis, multiple sclerosis, and atherosclerosis also showed strong correlation with the gut microbiome. To conclude currently microbiome researches, the importance of a healthy gut microbiome is also highlighted by the recently described gut-brain, gut-liver, and gut-lung axis, making the gut a central organ for human health. Nowadays, several approaches related to the gut microbiome had been applied in clinical therapies, such as fecal microbiota transplantation (FMT) and prebiotics, dietary fiber induction.

Preliminary research on the human microbiome is mainly descriptive. The combination of next-generation sequencing technology and advanced bioinformatics has brought us into the door of revolutionary insight into the composition, function, and activity of the human gut microbiome. The current data and analysis "pipeline" is still under development and improvement. In the face of continuous and significant technological advances in microbiology over the past decade, the research community has failed to set a consistent microbiome research standard. The progress of methodology has greatly promoted the research of microbiology. Despite all the progress made in this area, there is still no perfect and universal method. The technology toolbox will reduce the bias that each technology brings and provide a more comprehensive understanding of biological systems as a whole. On the contrary, if they have been using tools to improve repeatability, they will help return to the project. It is important to see that attempts to guard against threats of reproducibility, reproducibility, robustness, and generalizability are positive forces that will improve science. Here, we built a standard bioinformatic analysis pipeline and applied it to real metagenomics sequencing data. Further, we constructed a database with over 10,000 curated analyzed metagenomics samples. With the Deepomics platform and bio-Oviz framework, we implemented the Gut Microbiome Analysis Platform, which lets the users analyze and visualize their metagenomics data or compare it with the filtered samples from our database.

We have collected a set of analysis modules for human gut metagenomic data based on this topic's classic publications and applied them in colorectal cancer (CRC) dataset. CRC causes high morbidity and mortality worldwide, and noninvasive gut microbiome (GM) biomarkers are promising for early CRC diagnosis. However, the GM varies significantly based on ethnicity, diet, and living environment, suggesting varied GM biomarker performance in different regions. We performed a metagenomic association analysis on stools from 52 patients and 55 corresponding healthy family members who lived together to identify GM biomarkers for CRC in Chongqing, China. The GM of patients differed significantly from that of healthy controls. A total of 22 microbial genes were included as screening biomarkers with high accuracy in additional 46 cases, and 40 randomly selected healthy adults in Chongqing (area under the receive-operation curve (AUC) = 0.905, 95% CI 0.832–0.977). The classifier based on the identified 22 biomarkers also performed well in the cohort from Hong Kong (AUC = 0.811, 95% CI 0.715–0.907) and French (AUC = 0.859, 95% CI 0.773–0.944) populations. Quantitative PCR was applied to measure three selected biomarkers in the classification of CRC patients in an independent Chongqing population containing 30 cases and 30 controls. The best biomarker from Coprobacillus performed well with a high AUC (0.930, 95% CI 0.904–0.955). This study revealed increased sensitivity and applicability of our GM biomarkers compared with previous biomarkers, significantly promoting the early diagnosis of CRC.

We summarized the analysis modules we used in CRC dataset analysis and updated methods to construct a metagenomics analysis pipeline, which had been implemented on the Deep Omics Analysis Platform (DOAP) platform. The pipeline could be divided into basic analysis and advanced analysis. In the basic analysis, the pipeline will filter the low-quality reads and output a quality control report of every sample from the raw sequencing metagenomics data. Following steps in the pipeline will generate taxonomy profiling and function profiling from the high-quality sequencing data. Based on the taxonomy and functional profiling, we could analyze the alpha diversity, beta diversity, richness, PCA, and PCoA. Gathering the features profiling of samples is the output of the basic analysis. We designed the advanced analysis for a subject study like we want to figure out what microbiome features alter the healthy individuals to CRC patients above. Differential testing for feature s profiling between groups could reveal significantly different features. Enrichment network analysis and machine learning disease classifier will use the significantly different features as input. Enrichment network analysis clusters the features to reveals the core features. In our pipeline, we use the random forest to construct the disease classifier. Overall, our pipeline could treat raw metagenomics sequencing data as input, generated taxonomy profiling, functional profiling, and diversity per sample. We could determine the potential disease-related features and automatically construct a random forest disease classifier with group information.

We applied our metagenomics data analysis pipeline on a myasthenia gravis (MG) dataset. Myasthenia gravis (MG) is an acquired immune-mediated disorder of the neuromuscular junction that causes fluctuating skeletal muscle weakness and fatigue. Pediatric MG and adult MG have many different characteristics, and current MG diagnostic methods for children are not quite fit. Previous studies indicate that alterations in the gut microbiota may be associated with adult MG. However, it has not been determined whether the gut microbiota are altered in pediatric MG patients. Our study recruited 55 pediatric MG patients and 49 age- and gender-matched healthy controls (HC). We sequenced the fecal samples of recruited individuals using whole-genome shotgun sequencing, and analyzed the data with in-house bioinformatics pipeline. We built an MG disease classifier based on the abundance of four biomarker species, Fusobacterium mortiferum, Prevotella stercorea, Prevotella copri, and Megamonas funiformis. The classifier obtained 94% area under the curve (AUC) in cross-validation and 80% AUC in the independent validation cohort. Gut microbiome analysis revealed the presence of human adenovirus F/D in 10 MG patients. Significantly different pathways and gene families between MG patients and HC belonged to P. copri, Clostridium bartlettii, and Bacteroides massiliensis. Based on functional annotation, we found that the gut microbiome affects the production of short-chain fatty acids (SCFAs), and we confirmed the decrease in SCFA levels in pediatric MG patients via serum tests.

Besides the cross-section sampling in analyzing the relationship between disease and the human gut microbiome, time-series sampling is another commonly used experiment design. With the strain-level analysis tools PStrain we developed, we further implemented a package PStrain-tracer to analyze the time-series samples in FMT experiments. Fecal microbiota transplantation (FMT) may treat microbiome-associated diseases effectively. However, the mechanism and pattern of the FMT process require expositions. Previous studies indicated the necessity to track the FMT process at the microbial strain level. At this moment, shotgun metagenomic sequencing enables us to study strain variations during the FMT. We implemented a software package to study microbial strain variations during FMT from the shotgun metagenomic sequencing data. The package visualizes the strain alteration and traces the microbial engraftments during the FMT process. We applied the package to two typical FMT datasets, ulcerative colitis (UC) dataset and a Clostridium difficile infection (CDI) dataset. We observed that when the engrafted species has more than one strain in the source sample, 99.3% of the engrafted species will engraft only a subset of strains. We further confirmed that the all-or-nothing manner unsuited the engraftment of species with multiple strains by heterozygous SNPs count, revealing that strains prefer to engraft independently. Furthermore, we discovered a primary determinant of strain engrafted success is their proportion in species, as the donor engrafted strains and the pre-FMT engrafted strains with proportions 33.10 % (p-value = 6e-06) and 37.08 % (p-value = 9e-05) significantly higher than ungrafted strains on average, respectively. All the data sets indicated that the diversity of strains bursts after FMT and decreases to one after eight weeks for twelve species. Previous studies neglected strains with their corresponding species showing insignificant differences between different samples. With the package, from the UC data set, we successfully determined the strain variations of the species Roseburia intestinalis, a beneficial species reducing intestinal inflammation, colonized in the cured UC patient being engrafted from the donor, even if the patient hosted the same species yet before treatment; and from the CDI datasets, we found seven strains in donors and one strain in pre-FMT recipients from eight species that associated CDI FMT failure. We demonstrated the necessity of analyzing whole-genome shotgun metagenomic data of FMT at the strain level. Also, we implemented a package to study FMT at the strain level and utilized it to uncover new knowledge about FMT.

Finally, we built an online Gut Microbiome Analysis Platform (GMAP). We collected whole-genome shotgun metagenomic sequencing data of 11,688 samples from 54 projects. These samples came from 15 countries and included 29 diseases. To improve reusability and accessibility, we performed curated analysis on these samples according to the analysis process we built on DOAP to get the taxonomy profiling of each sample. At the same time, we have collected and integrated the metadata collected by the curatedMeta database, GMrepo database, and HMgDB, providing the most detailed metadata so far for the metagenomic sequencing data from the human intestine we used. Based on these metadata, users can filter samples based on keywords searching in our database and add them to their personal online data set. Both taxonomy profiling and metadata of this online dataset can be downloaded directly. Or as part of the input of the analysis module. We have equipped DOAP with basic inter-group comparative analysis modules, including diversity, differential testing, enterotype, PCA, and PCoA. The user can upload his personal taxonomy profiling data as input or combine the data set they filtered from our database to do the comparative analysis. Simultaneously, we provide online interactive visualization for all analysis modules, and users can adjust the visualization online to generate pictures for publications.