Deep Learning in Metagenomics and Spatial Transcriptomics: Phenotype and Gene Inference
深度學習在宏基因組學和空間轉錄組學中的應用:表型和基因推斷
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 18 Sept 2023 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(53191829-b74a-4326-a60b-69a8db521a55).html |
---|---|
Other link(s) | Links |
Abstract
The development of high-throughput sequencing technologies has facilitated the generation of massive amounts of omics data, providing new insights into the composition, function, and interactions of biological systems at the molecular level. Consequently, new fields such as metagenomics and spatial transcriptomics have emerged. Metagenomics seeks to understand the structure, function, and diversity of entire microbial communities, as well as their interactions within the environment. While spatial transcriptomics focuses on the analysis of gene expression in their spatial context within specific cells or tissues. In this thesis, I aim to develop and evaluate deep learning-based approaches for two cutting-edge predictive tasks in omics data analysis: phenotype prediction and gene inference. The common theme of these methods is to learn how omics data interact with human phenotypes, such as disease and tissue morphology, to help identify potential biomarkers or genetic variations that can be utilized to improve our understanding of human health and enable more targeted and personalized approaches to medicine.
Human phenotype prediction from microbiome data has broad implications in metagenomics. It is rare for the existing methods to consider abundance profiles from both known and unknown microbial organisms or capture the taxonomic relationships among microbial taxa, leading to significant information loss. On the other hand, deep learning has shown unprecedented advantages in classification tasks for its feature-learning ability. However, it encounters the opposite situation in microbiome-based disease prediction since high-dimensional low-sample-size metagenomic datasets can lead to severe overfitting; and the black-box model fails to provide biological explanations. To circumvent the related problems, in this thesis, we first developed MetaDR, a comprehensive machine learning-based framework that integrates various information and deep learning to predict human diseases. Experimental results indicate that MetaDR achieves competitive prediction performance with a reduction in running time, and effectively discovers the informative features with biological insights. Secondly, given that single-point-based microbiome composition may not capture the dynamic patterns between the temporal changes and human phenotypes, we proposed another comprehensive deep learning-based framework MicroGRU, consisting of well-designed data preparation strategies and a recurrent neural network to predict human host status from longitudinal microbiome data. We evaluated the proposed framework on both semi-synthetic and real datasets based on different sequencing technologies and metagenomic contexts, the results indicate that our method achieves robust performance compared to other baseline and state-of-the-art classifiers and provides a significant reduction in prediction time. Lastly, we summarized the existing challenges in metagenomics and presented MicroTDA, a deep learning-based framework that takes advantage of the taxonomic annotations on the phylogenetic tree for feature engineering and leverages the denoising autoencoders to predict host phenotype from human microbiome data. MicroTDA was evaluated on several metagenomic datasets from both 16S rRNA gene and shotgun metagenomic sequencing. The experiment results demonstrate its state-of-the-art performance and robust generalization.
On the other hand, tissue context and molecular profiling are commonly used measures in understanding normal development and disease pathology. In recent years, the development of spatial molecular profiling technologies (e.g., spatial transcriptomics) has enabled the exploration of quantitative links between tissue morphology and gene expression. While these technologies integrally preserve both cell morphological contexts and molecular characterization, they are costly and time-consuming. Furthermore, there are few studies that investigate a comprehensive evaluation of the potential for extracting molecular features from tissue images. In this premise, we introduced HE2Gene, a deep learning-based method to predict gene expression from hematoxylin and eosin (H&E)-stained images. By leveraging multi-task learning and patch-based spatial dependencies, we successfully predict the expression of hundreds of target genes from breast tissue morphology. Besides, we also performed a thorough evaluation of our methods in comparison to another diagnostic task and spatial RNA-seq data with different technical contexts, and a variety of experiments demonstrate that HE2Gene is comparable to state-of-the-art methods.
Collectively, the proposed methods represent a promising contribution to the application of deep learning in metagenomics and spatial transcriptomics. By integrating advanced computational methods with cutting-edge sequencing technologies, our approaches offer significant advancements in understanding the complex relationships between microbiome composition and human phenotypes, as well as tissue morphology and gene expression patterns. These findings pave the way for future research and applications in personalized medicine, diagnostics, and targeted therapeutic interventions.
Human phenotype prediction from microbiome data has broad implications in metagenomics. It is rare for the existing methods to consider abundance profiles from both known and unknown microbial organisms or capture the taxonomic relationships among microbial taxa, leading to significant information loss. On the other hand, deep learning has shown unprecedented advantages in classification tasks for its feature-learning ability. However, it encounters the opposite situation in microbiome-based disease prediction since high-dimensional low-sample-size metagenomic datasets can lead to severe overfitting; and the black-box model fails to provide biological explanations. To circumvent the related problems, in this thesis, we first developed MetaDR, a comprehensive machine learning-based framework that integrates various information and deep learning to predict human diseases. Experimental results indicate that MetaDR achieves competitive prediction performance with a reduction in running time, and effectively discovers the informative features with biological insights. Secondly, given that single-point-based microbiome composition may not capture the dynamic patterns between the temporal changes and human phenotypes, we proposed another comprehensive deep learning-based framework MicroGRU, consisting of well-designed data preparation strategies and a recurrent neural network to predict human host status from longitudinal microbiome data. We evaluated the proposed framework on both semi-synthetic and real datasets based on different sequencing technologies and metagenomic contexts, the results indicate that our method achieves robust performance compared to other baseline and state-of-the-art classifiers and provides a significant reduction in prediction time. Lastly, we summarized the existing challenges in metagenomics and presented MicroTDA, a deep learning-based framework that takes advantage of the taxonomic annotations on the phylogenetic tree for feature engineering and leverages the denoising autoencoders to predict host phenotype from human microbiome data. MicroTDA was evaluated on several metagenomic datasets from both 16S rRNA gene and shotgun metagenomic sequencing. The experiment results demonstrate its state-of-the-art performance and robust generalization.
On the other hand, tissue context and molecular profiling are commonly used measures in understanding normal development and disease pathology. In recent years, the development of spatial molecular profiling technologies (e.g., spatial transcriptomics) has enabled the exploration of quantitative links between tissue morphology and gene expression. While these technologies integrally preserve both cell morphological contexts and molecular characterization, they are costly and time-consuming. Furthermore, there are few studies that investigate a comprehensive evaluation of the potential for extracting molecular features from tissue images. In this premise, we introduced HE2Gene, a deep learning-based method to predict gene expression from hematoxylin and eosin (H&E)-stained images. By leveraging multi-task learning and patch-based spatial dependencies, we successfully predict the expression of hundreds of target genes from breast tissue morphology. Besides, we also performed a thorough evaluation of our methods in comparison to another diagnostic task and spatial RNA-seq data with different technical contexts, and a variety of experiments demonstrate that HE2Gene is comparable to state-of-the-art methods.
Collectively, the proposed methods represent a promising contribution to the application of deep learning in metagenomics and spatial transcriptomics. By integrating advanced computational methods with cutting-edge sequencing technologies, our approaches offer significant advancements in understanding the complex relationships between microbiome composition and human phenotypes, as well as tissue morphology and gene expression patterns. These findings pave the way for future research and applications in personalized medicine, diagnostics, and targeted therapeutic interventions.