Computational Methods for the Pathogenesis of Complex Diseases with Data Integration


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date5 Sept 2023


Nearly ten million (or almost one in six) fatalities in 2020 were reported due to cancer, which is the leading cause of mortality in the world. The most prevalent cancers are lung, gastric, and breast cancers. Many genetic and epigenetic changes lead to functional changes in the underlying pathology of cancer. The key to precision medicine lies in identifying the candidate hallmarks and understanding the underlying genetic and molecular pathways toward tumor heterogeneity.

Driven by high-throughput technological breakthroughs, the rapid accumulation of omics data has given rise to the concept of "big data" in cancer. It is crucial to elucidate the pathological mechanism and provide promising biomarkers by integrating and analyzing the multi-omics data, including microarray, RNA-seq, scRNA-seq, DNA methylation microarray, Hi-C, and other high-throughput sequencing technologies. Monitoring gene expression and transcriptome changes using RNA-seq can help understand tumor classification and progression. The inhibition of expression caused by abnormal methylation of tumor suppressor genes can lead to uncontrolled proliferation and invasion of tumor cells. Chromosomal instability caused by the overall DNA hypomethylation of the genome has been found in many tumor studies. Single-cell analysis reveals tumor cells' heterogeneity by defining the sub-populations and investigating immune environment changes. Emerging studies have shown that three-dimensional (3D) genome structure plays an essential role in many diseases associated with genetic variants located in non-coding regions of the genome. With the help of 3D genome technology, we can study the impact of alterations in non-coding genomic areas of disease on chromatin interactions and gene expression.

As there is a myriad of clinical evidence demonstrating a wide spectrum of gene regulation that may be connected with their unique traits, accurately identified biomarkers of the prognostic tumor variety holds promise. Therefore, we identified co-activators by DNA co-binding motifs using MotifHub in PRAD, tumor immune microenvironment changes using scRNA-seq analysis in PCNSL, prognosis-related alternative splicing (AS) events using Cox regression model in BLCA, DNA methylation-driven circling tumor DNA (CTC) in CRC, immune-related lncRNA signatures in CRC, and glycerolipid metabolism-associated risk model in HCC.

In this thesis, I focus on integrating multi-omics data to explore the pathological mechanism and pave the way for precision medicine. The content was divided into the following sections:

Chapter 1: An introduction to computational biology development for integrating multi-omics data and the current biomarkers research in the diagnosis and prognosis of complex diseases, including cancers.
In the first chapter, the timeline of the development of genomics, the concepts and related projects of precision medicine, and bioinformatics analysis for patient-specific medical research were introduced. Then, omics-based cancer biomarker identification approaches were discussed, and PRAD, BLCA, PCNSL, CRC, and HCC were taken as examples. Then, we summarized the advantages and limitations.

Chapter 2: Discovering different Hi-C promoter-enhancer pair groups related to diseases by MotifHub.
The MotifHub algorithm is a novel method to identify DNA motif groups to chromatin-interaction sequences (Hi-C data). It can be applied in complex human disease datasets. Based on the results, we suggest there are different TFs regulation patterns between cancer and normal samples. In this research, we identified co-binding TFs enriched in PRAD, consisting of FOXA1, HOXB12, and GATA3. FOXA1-MYC motif pair groups which were validated by expression correlation, mutual genomic alteration, co-binding motifs, and functional annotation analysis.

Chapter 3: The comprehensive and systematic identification of BLCA-specific SF-regulated, survival-related AS events.
BLCA is a complex disease with high morbidity and mortality. Changes in alternative splicing (AS) and splicing factor (SF) can affect gene expression, thus playing an essential role in tumorigenesis. We profiled genes that AS occurred in pan-cancer and five SFs' expression in tumor and normal samples in BLCA. We selected CLIP-seq data for the validation of the interaction regulated by RBP. Our study paves the way for potential therapeutic targets of BLCA.

Chapter 4: Risk scoring based on DNA methylation-driven related DEGs for colorectal cancer prognosis with systematic insights.
CRC is the third most prevalent cancer worldwide. Severe problems such as CRC patient mortality and poor prognosis need to be addressed urgently. In this study, we proposed a risk score model consisting of four risk factors based on DNA methylation-driven CTCs for assessing the outcome of patients. The risk score model has good performance on an external dataset in terms of C-index value, ROC, tROC validation. Our risk score model provides a new perspective on providing personalized treatment strategies for CRC patients, promoting precision medicine development further.

Chapter 5: Construction of immune infiltration-related lncRNA signatures based on machine learning for the prognosis of colon cancer.
Molecular biomarkers play key roles in CRC prognosis. In particular, immune-related lncRNAs have attracted enormous interest in cancer diagnosis and treatment, but less is known about their potential roles. We aimed to investigate dysfunctional immune-associated lncRNA and construct a risk model for distinguishing the outcomes of patients. By leveraging the microarray, sequencing, and clinical data for immune cells and CRC patients, we identified the housekeeping lncRNAs in 19 immune cell types with a cell type-specificity index. A risk score model with six lncRNAs was proposed with robust ROC performance on an independent dataset.

Chapter 6: Identification of glycerolipid metabolism-associated multi-omics prognostic signatures in HCC.
HCC is one of the most common cancers worldwide and accounts for substantial morbidity and mortality. The increased glycerolipid metabolism needs for cancer cells underscore the importance of metabolic pathways in survival time. Multi-omics data was collected and analyzed to visualize the alteration of glycerolipid metabolism (GMM)-associated genes at the mRNA, methylation, CNV, and somatic mutation levels. It provided GMMS as a candidate prognostic factor for HCC. Specifically, GMMS is related to cancer hallmarks and tumor immune environment. We identified drugs with GMMS-dependent sensitivity. Glycerolipid metabolism disorders might appear in malignant cells.

Chapter 7: Conclusions and further work.
Computational Approaches for the pathogenesis of complex diseases and their prognosis evaluations with data integration are discussed in the last chapter.