Machine Learning and Deep Learning Methods for Dissecting the Molecular Heterogeneity of Cancer


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date18 Jan 2024


It is now clear that major human cancers are complex and heterogeneous diseases. For instance, colorectal cancer (CRC) and ovarian cancer (OV), ranking as the third and fifth leading causes of cancer-related deaths worldwide, are highly heterogeneous cancers characterized by different therapeutic responses and prognostic phenotypes. Extensive studies have focused on identifying clinically relevant molecular subtypes to elucidate cancer heterogeneity. The development of high-throughput sequencing techniques has significantly improved the availability of multi-omics cancer data profiles. This advancement has facilitated the implementation of multi-omics cancer subtyping, enabling a holistic and systematic comprehension of the biological characteristics within tumors.

Cancer molecular subtyping requires high-quality omics profiles derived from tumor tissue specimens. One of the major tissue sources, formalin-fixed paraffin-embedded (FFPE), are widely available in the clinic. However, omics data derived from FFPE tissues often fail to meet these requirements due to the degradation nucleic acids during tissue fixation, embedding, and storage processes. Taking transcriptomic data as an example, RNA sequencing (RNA-seq) data derived from FFPE samples are distorted, resulting a poor quality that is not comparable to those obtained from FF tissues for molecular analysis of cancer. Therefore, it is imperative to recover transcriptomic next-generation sequencing (NGS) data derived from FFPE samples. Following the two motivations, my dissertation has been organized as follows:

In Chapter 1, I first presented a brief overview of CRC and OV about epidemiology, typical treatment, and the progress of efficacious disease treatment, such as targeted therapy and immunotherapy. I then introduced single-omics and multi-omics molecular subtyping in the two types of cancer, highlighting the significance of using multi-omics data to dissect cancer heterogeneity. Furthermore, the challenges associated with multi-omics cancer subtyping were discussed. While FF tissues are more commonly used for generating NGS data, FFPE tissues can serve as an alternative option. The advantages and disadvantages of performing NGS data analysis based on these two different cancer tissue resources, FF and FFPE, were discussed. I emphasized the widespread availability and cost-effectiveness of FFPE cancer samples, as well as the challenges associated with RNA-seq and subsequent analyses using FFPE tissues. I finally summarized the research contents and objectives of this thesis.

In Chapter 2, I developed an integrated framework to perform CRC three-omics (mRNA expression, microRNA expression, and DNA methylation) subtyping using sparse multiple canonical correlation analysis (mCCA). This framework allowed for the identification of five distinct subtypes. Compared with the transcriptomic subtypes, the mesenchymal subtype could be further subdivided into two subgroups with distinct molecular and clinical features. One of the mesenchymal subtypes was found to have the potential to benefit immune therapy based on molecular biomarker analyses. I demonstrated the generalization ability of the constructed multi-omics classifier in identifying biologically coherent and clinically relevant subgroups. Importantly, the framework can accept inputs of omics types from any combination of the three types of omics for classification. This framework provides the potential to make efficient use of other omics, such as microRNA expression data and DNA methylation data for CRC molecular cancer subtyping.

Based on the basic hypothesis that more types of multi-omics data analysis yield more information and contribute to a more comprehensive understanding of cancer heterogeneity. In Chapter 3, cancer heterogeneity was further investigated by incorporating two additional omics data types, copy number variation and somatic mutation. Using comprehensive five-omics data, I identified five distinct subtypes using similarity network fusion (SNF). Similar to the findings in Chapter 2, the transcriptome-based mesenchymal subtype was partitioned into two distinct subtypes. Notably, one of them was found to promote T cell exclusion and exhibited activated immunosuppressive activities. Furthermore, homologous recombination biomarker analysis revealed the potential of this subtype to benefit from Poly(ADP-ribose) polymerase inhibitor therapy. I constructed a five-omics classifier using the integrated five-omics data with sparse mCCA and the clusters derived from SNF. The constructed classifier demonstrated robust performance in validation analysis and supported both single-omics and multi-omics classification.

Most FFPE cancer tissues are unsuitable for RNA-seq or its related cancer research due to their suboptimal RNA quality. This problem was confirmed in Chapter 4 by evaluating the genome-wide transcript integrity number using FFPE-derived RNA-seq data from seven different cancer types, and FF-FFPE matched RNA-seq data of CRC. To recover FFPE-derived RNA-seq data, I developed a deep learning (DL)-based framework using 9568 FF primary tumor samples from The Cancer Genome Atlas across 28 cancer types and the given FFPE dataset that required recovery. The framework enabled the reliable recovery of gene expression profiles in both simulated and real-world FFPE-sequenced cancer samples. The recovered gene expression profiles showed higher comparability with matched FF profiles, exhibited more relevant biological properties, higher accuracy of molecular subtyping, and yielded a more reasonable prognosis. This study represents the first application of a DL model to rectify FFPE-derived RNA-seq data. Given its robustness across multiple cancer cohorts, this framework has the potential for recovering gene expression profiles from different types of cancers, thereby largely expanding the applicability of archived FFPE tissue samples, including invaluable retrospective information.

In Chapter 5, I summarized the research contents and implications of this thesis and gave future perspectives.

Taken together, the research based on cancer omics data I have conducted during my Ph.D. study, using machine learning and DL, enhances our understanding of CRC and OV heterogeneity, provides flexibility for cancer molecular subtyping with limited access to any single-omics or any combination of omics data and full exploitation of archived FFPE tissue specimens, and will provide insights into personalized medicine and contribute to the cancer research community.