Machine Learning Approaches for Early Cancer Detection and Omics Data Analysis
基於機器學習的癌癥早篩及組學數據分析
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 17 Sept 2024 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(52eb0945-7096-4f5c-86c4-7f18989046d8).html |
---|---|
Other link(s) | Links |
Abstract
Cancer remains a leading cause of death worldwide, with early detection playing a pivotal role in improving patient outcomes and reducing mortality rates. In recent years, the rapid advancement of high-throughput omics technologies, such as genomics, transcriptomics, proteomics, and metabolomics, has generated massive amounts of data with the potential to transform cancer diagnosis and treatment. Machine learning approaches have demonstrated promising results in extracting valuable information from omics data for early cancer detection and understanding the underlying molecular mechanisms. This thesis presents a comprehensive study on the development and application of machine learning methods, including metaheuristic optimization, traditional machine learning models, and deep learning models, to enhance early cancer detection and explore the generation and analysis of multi-omics data.
In this thesis, we review existing machine learning methods for early cancer detection and omics data analysis, providing critical insights into the related challenges and opportunities. Additionally, we introduce the process of developing machine learning methods, such as data preprocessing, model evaluation, model selection, and hypothesis testing.
For early cancer detection from genome-wide cell-free DNA, we propose an Adaptive Support Vector Machine (ASVM) that synergizes the Shuffled Frog Leaping Algorithm (SFLA) and Support Vector Machine (SVM). Extensive experiments suggest that the proposed ASVM potentially outperforms benchmark methods and could effectively detect early cancer signals from complex genomic data.
For early cancer detection from multi-modal biological features, we introduce AutoCancer, an automated multi-modal framework that leverages metaheuristic optimization and Transformer to integrate feature selection, neural architecture search, and hyperparameter optimization in an automated and simultaneous manner. Comparative experiments indicate that the proposed AutoCancer may exhibit robust performance in enabling early cancer detection compared to benchmark methods. Furthermore, AutoCancer was applied to identify key gene mutations and their combinations associated with NSCLC, as well as to pinpoint crucial factors at different stages and subtypes.
For omics data generation and analysis, we propose scTranslator, the first pre-trained, context-aware, and large-scale generative model for generating multi-omics data by translating single-cell transcriptome to proteome. We also investigate the downstream analysis of scTranslator, such as integrative regulatory inference, pseudo-knockout, cell marker analysis, cell clustering, and cell origin analysis based on pan-cancer data. These applications demonstrate the potential of scTranslator in improving our understanding of regulatory and interaction relationships at a high-resolution level and facilitating single-cell multi-omic analysis.
In conclusion, this thesis aims to contribute to the field of early cancer detection and omics data analysis by developing innovative machine learning methods that address the challenges posed by the complexity of cancer biology, the heterogeneity of cancer types, and the integration of diverse types of omics data.
In this thesis, we review existing machine learning methods for early cancer detection and omics data analysis, providing critical insights into the related challenges and opportunities. Additionally, we introduce the process of developing machine learning methods, such as data preprocessing, model evaluation, model selection, and hypothesis testing.
For early cancer detection from genome-wide cell-free DNA, we propose an Adaptive Support Vector Machine (ASVM) that synergizes the Shuffled Frog Leaping Algorithm (SFLA) and Support Vector Machine (SVM). Extensive experiments suggest that the proposed ASVM potentially outperforms benchmark methods and could effectively detect early cancer signals from complex genomic data.
For early cancer detection from multi-modal biological features, we introduce AutoCancer, an automated multi-modal framework that leverages metaheuristic optimization and Transformer to integrate feature selection, neural architecture search, and hyperparameter optimization in an automated and simultaneous manner. Comparative experiments indicate that the proposed AutoCancer may exhibit robust performance in enabling early cancer detection compared to benchmark methods. Furthermore, AutoCancer was applied to identify key gene mutations and their combinations associated with NSCLC, as well as to pinpoint crucial factors at different stages and subtypes.
For omics data generation and analysis, we propose scTranslator, the first pre-trained, context-aware, and large-scale generative model for generating multi-omics data by translating single-cell transcriptome to proteome. We also investigate the downstream analysis of scTranslator, such as integrative regulatory inference, pseudo-knockout, cell marker analysis, cell clustering, and cell origin analysis based on pan-cancer data. These applications demonstrate the potential of scTranslator in improving our understanding of regulatory and interaction relationships at a high-resolution level and facilitating single-cell multi-omic analysis.
In conclusion, this thesis aims to contribute to the field of early cancer detection and omics data analysis by developing innovative machine learning methods that address the challenges posed by the complexity of cancer biology, the heterogeneity of cancer types, and the integration of diverse types of omics data.