Machine Learning for Tumor Heterogeneity Analysis and Drug Efficacy Inference

Student thesis: Doctoral Thesis

Abstract

Deciphering the intricate characteristics of cancer is of significant research and clinical value, especially in tumor heterogeneity analysis, tumor subtyping and survival prognostication, and treatment efficacy inference. For tumor subtyping, it is worth noting that most research used only a single classification method within their respective studies, which may result in the inability to perform a more refined tumor classification. Indeed, there are several popular classification methods, including risk scoring based on the expression value of prognosis-related signature genes, non-negative matrix factorization (NMF) clustering, and consensus clustering. Therefore, in the first topic, I hybridized all three strategies on two independent head and neck cancer carcinoma (HNSC) datasets, which contain transcriptomic profiles and well-annotated metadata, to capture the more refined biological characteristics of HNSC. Interestingly, we observed replicate results that around one-third of high tumor immunity patients are at high-risk scores, contradicting previous reports that high tumor immunity patients were considered with favorable prognosis. We further matched the gene signature expression profiles of different patient sub-populations to drug-induced cell line transcriptomes to evaluate the association between tumor subtyping and treatment efficacy. We identified a subset (SigH_IciA subset) of patients with poor prognosis likely to obtain elevated drug treatment sensitivity to reverse differential expression signature genes.

The comparison results between RNA and protein levels of candidate target genes in my first project are consistent with previous reports that the RNA expression and protein quantification are not strongly correlated at the genome-wide level. Many past studies were conducted based on genomic mutations and RNA-seq data, which are well-annotated in public pharmacogenomic databases rich in corresponding cell viability assay of drug responses. However, datasets with pharmacogenomic protein quantifications are relatively scarce and poorly studied. With the availability of the recently enriched proteomic dataset ProCan-DepMapSanger, we systematically evaluated the interplays among genomic mutations, transcriptions, and protein expressions across cancer cell lines. We integrated the proteomic map with drug molecular chemical features to construct a Bi-modal Drug Response Network (BDRN) to infer the drug sensitivities of cancer cell lines. We found that protein quantifications can slightly improve drug response prediction performance compared to the model trained on transcriptome profiles. To identify cancer genes (defined as mutated genes that are causally implicated in oncogenesis) and putative therapeutic target genes at the protein level, we conducted a comparison for genes exhibiting differential expression (DEGs) at the RNA level or DEGs at the protein level in terms of their overlap with the 1102 cancer genes that annotated in the expert-guided OncoKB database. We found that DEGs at the protein level obtained higher overlapping with OncoKB-listed cancer genes.

The bulk-based sequencing technologies characterize the averaged gene expression levels (whether sequencing transcriptomics or proteomics) of a mixture of heterogeneous cells, concealing the actual diversity of individual cells. In contrast, single-cell RNA sequencing (scRNA-seq) quantifies the gene expression at cellular resolution, enabling us to characterize diverse subpopulations of cancer (also known as intratumor heterogeneity), a well-known issue that could lead to anti-cancer treatment failure. However, pharmacogenomic information related to their corresponding scRNA-seq is often limited. To address this, a transfer learning model was constructed to predict the drug response sensitivity of Single Cells by integrating Adversarial Domain adaptation (SCAD) to alleviate the unfavorable effects of dataset bias. Compared to the two baseline models without adopting domain adaptation, models with transfer learning obtained elevated drug response prediction performance. In addition, our results suggest that it is suitable to learn knowledge from pre-clinical cell lines to infer pre-existed cell subpopulations with different drug sensitivities prior to drug exposure. Furthermore, our model offers a new perspective on drug combinations. We also adopted the ML model explainer IntegratedGradients to identify promising biomarkers. The identified drug sensitivity biomarkers reveal insights into tumor heterogeneity and treatment at cellular resolution.

Above all, we adopted three machine learning frameworks to infer the treatment efficacy of tumors and cancer prognostication among different omics features. We hope that our work can provide a reference for the precision medicine research community.
Date of Award27 Aug 2024
Original languageEnglish
Awarding Institution
  • City University of Hong Kong
SupervisorKa Chun WONG (Supervisor)

Cite this

'