Decoding Animal Gut Microbiomes through Advanced Bioinformatics Approaches and Disease Studies
基於高級生物信息學方法與疾病研究解碼動物腸道微生物組
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 27 Sept 2023 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(8755b45f-96f2-4977-8acc-674314dab651).html |
---|---|
Other link(s) | Links |
Abstract
The gut microbiota, primarily composed of bacteria but includes fungi, archaea, and viruses, evolves from birth, influenced by delivery mode, dietary habits, hygiene habits, and stomach acidity. Essential for digestion, nutrient absorption, and immunity, its dysbiosis can lead to diseases like cancer. The gut virome, a vital gut microbiome component, influences microbial dynamics. Enhanced by high-throughput technologies like Next-generation sequencing (NGS), our understanding of the gut microbiome has deepened through marker gene and shotgun metagenomic sequencing. However, gut microbiome data interpretation faces challenges, from sequencing accuracy to genome reconstruction. In this thesis, we applied advanced bioinformatics approaches to four gut microbial studies and presented new insights about the role of gut microbiome in health and disease.
Feline panleukopenia (FPL), a highly contagious and frequently fatal disease of cats, is caused by Feline parvovirus (FPV) and Canine parvovirus (CPV). How FPV infection might impact the composition of the gut virome in cats was insufficiently characterized. We used metatranscriptomic and viral particle enrichment metagenomic approaches to characterize the gut viromes of 23 cats naturally infected with FPV (FPV-cases) and 36 age-matched healthy shelter cats (healthy controls). We discovered reduced gut virome diversity and a significant dissimilarity in viral composition following FPV infection. We identified 22 significant differentially abundant genera, including the enriched Kobuvirus and Protoparvovirus, in cats with FPV infection. Furthermore, contig-level taxonomy results showed that healthy control cats predominantly had feline coronavirus, Mamastrovirus 2, and feline bocavirus, often co-infected with all three. The latter and feline bocaparvovirus 3 were notably more common in FPV-cases than healthy controls. Additionally, we identified two novel feline astroviruses termed Feline astrovirus 3 and 4. These viruses were isolated from three healthy shelter-housed kittens and from a kitten with diarrhea that was co-infected with FPV. The differences in gut virome composition revealed here indicated that further investigations are warranted to determine associations between enteric viral co-infections and clinical disease severity in cats with FPL.
Subsequently, we characterized the diversity of Carnivore protoparvovirus 1 (genus Protoparvovirus, family Parvoviridae) variants in 18 samples from FPV-cases group using targeted parvoviral DNA metagenomics. All samples comprised FPV only. Compared with the reference FPV genome, isolated in 1967, 44 mutations were detected. Ten of these were nonsynonymous, including nine in nonstructural genes and one in VP1/VP2 (Val232Ile), the only one to exhibit interhost diversity, being present in five sequences. There were five other polymorphic nucleotide positions, all with synonymous mutations. Intrahost diversity at all polymorphic positions was low, with subconsensus variant frequencies (SVF) of <1%, except for two positions in two samples with SVF of 1.1–1.3%. Intrahost nucleotide diversity was measured across the whole genome (0.7–1.5%) and for each gene and was highest in the NS2 gene of four samples (1.2–1.9%). Overall, intrahost viral genetic diversity was limited and most mutations observed were synonymous, indicative of a low background mutation rate and strong selective constraints.
Colorectal cancer (CRC) is the third leading cause of worldwide cancer-related death. Zearalenone (ZEA), a secondary metabolite of Fusarium fungi, has been shown to promote cancer in vitro. However, the lack of animal studies hinders a deeper mechanistic understanding of the cancer-promoting effects of ZEA. We conducted a study to determine the effect of ZEA on CRC progression and its underlying mechanisms. Through integrative analyses of transcriptomics, metabolomics, metagenomics, and host phenotypes, we found that a 4-week ZEA exposure tripled tumor weight and altered the tumor transcriptome, serum metabolome, and gut microbiota. ZEA exposure significantly increased the mRNA and protein levels of BEST4, DGKB, and Ki67 and the phosphorylation levels of ERK1/2 and AKT. Serum metabolomic analysis revealed that the levels of amino acids, including histidine, arginine, citrulline, and glycine, decreased significantly in the ZEA group. Furthermore, ZEA lowered the alpha diversity of the gut microbiota and reduced the abundance of nine genera, including Tuzzerella and Rikenella. Further association analysis indicated that Tuzzerella was negatively associated with the expression of BEST4 and DGKB genes, serum uric acid levels, and tumor weight. Additionally, circulatory hippuric acid levels positively correlated with tumor weight and the expression of oncogenic genes, including ROBO3, JAK3, and BEST4. Our results indicated that ZEA promotes colon cancer progression by enhancing the BEST4/AKT/ ERK1/2 pathway, lowering circulatory amino acid concentrations, altering gut microbiota composition, and suppressing short-chain fatty acids production.
In exploring the complexity of gut virome and the impacts of ZEA interventions on colon cancer, it became clear that substantial portions of biodiversity remain undocumented, even in most comprehensive sequence databases. Specifically, the gut virome study highlighted challenges in classifying numerous viral reads; Through the ZEA study, we realized the importance of 16S rDNA sequence classification in accurately understanding microbial compositions. Such shortcomings often arise from over-reliance on conventional methods anchored to reference databases, especially when analyzing fecal samples. To bridge this knowledge gap, we developed MT-ALBERTax, a novel 16S rDNA sequence classifier based on the ALBERT transformer encoder with a self-attention mechanism. This model was improved with multi-task learning, efficiently trained and identified complex patterns, and minimized overfitting by leveraging abstract knowledge across subtasks.
In developing MT-ALBERTax, we aimed to prevent data leakage by strategically dividing the dataset based on phylogenetic tree guidance, using varied train ratios and identity thresholds. The results revealed that higher train ratios and identity thresholds could artificially inflate the accuracy of the RDP-Custom classifier. By targeting datasets where RDP-Custom's accuracy was below 50%, MT-ALBERTax consistently outperformed it, showing a lead of up to 16% when trained with over 100 sequences per genus. Both MT-ALBERTax and RDP-Custom classifiers showed optimal accuracy for sequences around 450 bp, aligning against the V3-V4 region's reference. As sequences lengthened, MT-ALBERTax's advantage over RDP-Custom grew to 12%. It widened to 20% with an increased Mean Training-Set Distance, highlighting its superior generalization—nevertheless, the model's specificity to the V3-V4 genomic region and biological interpretations are inadequate. Future research aims to broaden the genomic range, offer deeper insights into MT-ALBERTax's working principles, and compare its performance with other models, notably BERTax.
The gut microbiota is critical for digestion, nutrient absorption, and immunity; dysbiosis within gut microbiota can be linked to numerous diseases. While high-throughput technologies like NGS have deepened our insights into its role in diseases, challenges persist due to data complexity. This thesis employed advanced bioinformatic approaches to investigate gut virome dynamics, particularly concerning FPL and the implications of ZEA interventions on CRC. Notably, the introduction of MT-ALBERTax, a novel taxonomic classification tool, demonstrates the robust performance in deciphering the intricate patterns of 16S rDNA sequences. Future research will leverage multi-omics data and Artificial Intelligence (AI)-driven techniques to understand further and decode the intricate relationships between the gut microbiome and disease.
Feline panleukopenia (FPL), a highly contagious and frequently fatal disease of cats, is caused by Feline parvovirus (FPV) and Canine parvovirus (CPV). How FPV infection might impact the composition of the gut virome in cats was insufficiently characterized. We used metatranscriptomic and viral particle enrichment metagenomic approaches to characterize the gut viromes of 23 cats naturally infected with FPV (FPV-cases) and 36 age-matched healthy shelter cats (healthy controls). We discovered reduced gut virome diversity and a significant dissimilarity in viral composition following FPV infection. We identified 22 significant differentially abundant genera, including the enriched Kobuvirus and Protoparvovirus, in cats with FPV infection. Furthermore, contig-level taxonomy results showed that healthy control cats predominantly had feline coronavirus, Mamastrovirus 2, and feline bocavirus, often co-infected with all three. The latter and feline bocaparvovirus 3 were notably more common in FPV-cases than healthy controls. Additionally, we identified two novel feline astroviruses termed Feline astrovirus 3 and 4. These viruses were isolated from three healthy shelter-housed kittens and from a kitten with diarrhea that was co-infected with FPV. The differences in gut virome composition revealed here indicated that further investigations are warranted to determine associations between enteric viral co-infections and clinical disease severity in cats with FPL.
Subsequently, we characterized the diversity of Carnivore protoparvovirus 1 (genus Protoparvovirus, family Parvoviridae) variants in 18 samples from FPV-cases group using targeted parvoviral DNA metagenomics. All samples comprised FPV only. Compared with the reference FPV genome, isolated in 1967, 44 mutations were detected. Ten of these were nonsynonymous, including nine in nonstructural genes and one in VP1/VP2 (Val232Ile), the only one to exhibit interhost diversity, being present in five sequences. There were five other polymorphic nucleotide positions, all with synonymous mutations. Intrahost diversity at all polymorphic positions was low, with subconsensus variant frequencies (SVF) of <1%, except for two positions in two samples with SVF of 1.1–1.3%. Intrahost nucleotide diversity was measured across the whole genome (0.7–1.5%) and for each gene and was highest in the NS2 gene of four samples (1.2–1.9%). Overall, intrahost viral genetic diversity was limited and most mutations observed were synonymous, indicative of a low background mutation rate and strong selective constraints.
Colorectal cancer (CRC) is the third leading cause of worldwide cancer-related death. Zearalenone (ZEA), a secondary metabolite of Fusarium fungi, has been shown to promote cancer in vitro. However, the lack of animal studies hinders a deeper mechanistic understanding of the cancer-promoting effects of ZEA. We conducted a study to determine the effect of ZEA on CRC progression and its underlying mechanisms. Through integrative analyses of transcriptomics, metabolomics, metagenomics, and host phenotypes, we found that a 4-week ZEA exposure tripled tumor weight and altered the tumor transcriptome, serum metabolome, and gut microbiota. ZEA exposure significantly increased the mRNA and protein levels of BEST4, DGKB, and Ki67 and the phosphorylation levels of ERK1/2 and AKT. Serum metabolomic analysis revealed that the levels of amino acids, including histidine, arginine, citrulline, and glycine, decreased significantly in the ZEA group. Furthermore, ZEA lowered the alpha diversity of the gut microbiota and reduced the abundance of nine genera, including Tuzzerella and Rikenella. Further association analysis indicated that Tuzzerella was negatively associated with the expression of BEST4 and DGKB genes, serum uric acid levels, and tumor weight. Additionally, circulatory hippuric acid levels positively correlated with tumor weight and the expression of oncogenic genes, including ROBO3, JAK3, and BEST4. Our results indicated that ZEA promotes colon cancer progression by enhancing the BEST4/AKT/ ERK1/2 pathway, lowering circulatory amino acid concentrations, altering gut microbiota composition, and suppressing short-chain fatty acids production.
In exploring the complexity of gut virome and the impacts of ZEA interventions on colon cancer, it became clear that substantial portions of biodiversity remain undocumented, even in most comprehensive sequence databases. Specifically, the gut virome study highlighted challenges in classifying numerous viral reads; Through the ZEA study, we realized the importance of 16S rDNA sequence classification in accurately understanding microbial compositions. Such shortcomings often arise from over-reliance on conventional methods anchored to reference databases, especially when analyzing fecal samples. To bridge this knowledge gap, we developed MT-ALBERTax, a novel 16S rDNA sequence classifier based on the ALBERT transformer encoder with a self-attention mechanism. This model was improved with multi-task learning, efficiently trained and identified complex patterns, and minimized overfitting by leveraging abstract knowledge across subtasks.
In developing MT-ALBERTax, we aimed to prevent data leakage by strategically dividing the dataset based on phylogenetic tree guidance, using varied train ratios and identity thresholds. The results revealed that higher train ratios and identity thresholds could artificially inflate the accuracy of the RDP-Custom classifier. By targeting datasets where RDP-Custom's accuracy was below 50%, MT-ALBERTax consistently outperformed it, showing a lead of up to 16% when trained with over 100 sequences per genus. Both MT-ALBERTax and RDP-Custom classifiers showed optimal accuracy for sequences around 450 bp, aligning against the V3-V4 region's reference. As sequences lengthened, MT-ALBERTax's advantage over RDP-Custom grew to 12%. It widened to 20% with an increased Mean Training-Set Distance, highlighting its superior generalization—nevertheless, the model's specificity to the V3-V4 genomic region and biological interpretations are inadequate. Future research aims to broaden the genomic range, offer deeper insights into MT-ALBERTax's working principles, and compare its performance with other models, notably BERTax.
The gut microbiota is critical for digestion, nutrient absorption, and immunity; dysbiosis within gut microbiota can be linked to numerous diseases. While high-throughput technologies like NGS have deepened our insights into its role in diseases, challenges persist due to data complexity. This thesis employed advanced bioinformatic approaches to investigate gut virome dynamics, particularly concerning FPL and the implications of ZEA interventions on CRC. Notably, the introduction of MT-ALBERTax, a novel taxonomic classification tool, demonstrates the robust performance in deciphering the intricate patterns of 16S rDNA sequences. Future research will leverage multi-omics data and Artificial Intelligence (AI)-driven techniques to understand further and decode the intricate relationships between the gut microbiome and disease.