Pattern Recognition on Biomolecular Data: RNA G-quadruplex Search, 2'-O-methylation Site Identification, and Cancer Detection
基於生物分子數據的模式識別:RNA G-四聯體搜尋, 2'-O-甲基化修飾位點識別, 以及癌症檢測
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 11 Sept 2023 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(13b45192-46f8-4f44-96db-b3a899640a94).html |
---|---|
Other link(s) | Links |
Abstract
Pattern recognition is widely adopted in biomolecular data analysis, such as sequence analysis, structure folding, disease diagnostics, and biomarker discovery. Biomolecular data such as DNAs (genome), RNAs (transcriptome), proteins (proteome), and metabolites (metabolome) are widely adopted as fingerprints to identify patterns, predict functions, diagnose diseases, and explore biological interpretations. In this thesis, we aim to identify modification sites and detect cancers based on molecular data.
In the first topic, we propose a method to identify RNA G-quadruplex (rG4) secondary structure on mRNA. Guanine (G)-rich sequences in RNA can fold into diverse rG4 structures to mediate various biological functions and cellular processes in eukaryotic organisms. However, the presence, locations, and functions of rG4s in prokaryotes are still elusive. We used QUMA-1, an rG4-specific fluorescent probe, to detect rG4 structures in a wide range of bacterial species both in vitro and in vivo and found rG4 to be an abundant RNA secondary structure across those species. Subsequently, to identify bacterial rG4 sites in transcriptome, the model Escherichia coli strain and a major human pathogen, Pseudomonas aeruginosa, were subjected to recently developed high-throughput rG4 structure sequencing (rG4-seq). In total, 168 and 161 in vitro rG4 sites were found in E. coli and P. aeruginosa, respectively. The genes with rG4 sites were found to be involved in virulence, gene regulation, cell envelope synthesis, and metabolism. More importantly, biophysical assays revealed the formation of a group of rG4 sites in mRNAs (such as hemL and bswR), and they were functionally validated in cells by genetic (point mutation and lux reporter assays) and phenotypic experiments, providing substantial evidence for the formation and function of rG4s in bacteria. Overall, the proposed method uncovers important regulatory functions of rG4s in bacterial pathogenicity and metabolic pathways and strongly suggests that rG4s exist and can be detected in a wide range of bacterial species.
In the second topic, we propose a method to identify 2′-O-Methyl (Nm) methylation modification sites. Nm is an abundant methylated modification in all kinds of RNA species in eukaryotic cells. Although Nm is present in the ribosomal RNA of bacteria, its occurrence in mRNA still remains elusive. Here, we used a new developed high-throughput Nm sequencing (Nm-seq) to identify the Nm sites in mRNA from two model bacterial species (Escherichia coli and Pseudomonas aeruginosa). We totally identified 163 Nm sites in E. coli mRNA and 300 Nm sites in P. aeruginosa mRNA, suggesting the abundant occurrence of Nm modification in these two bacteria. Our results revealed the distinct Nm distribution pattern in bacterial mRNA. We found more than half of the Nm sites located in CDS and occurred in one gene with a 10-bp conserved motif (NmGACUGUGANA). Nm sites showed different base and codon preference between E. coli and P. aeruginosa. More Nm sites are present in the second base in a codon as compared with other positions. Nm modified genes played potential functional roles in oxidation-reduction, transcriptional regulation and transmembrane transport. In addition, it’s important that a group of Nm sites are present in virulence genes (such as lasI, gacA, tssA1, vfr, vqsR, amrZ and kinB) in P. aeruginosa. Taken together, this study uncovered hundreds of Nm sites in bacterial mRNA with important biological roles.
In the third topic, we proposed a deep transfer learning model to localize cancers based on blood molecules. Cancer is prevalent with high mortality across different organs and tissues. Although late-stage cancer is still challenging to be addressed, early-stage cancer normally has a higher survival rate than its late-stage counterpart. Recently, computational models have been proposed for cancer detection and achieved promising performance. However, existing methods suffer from their low performance in cancer localization. Numerically, CancerA1DE improves the AUC of binary cancer detection from 89% to 99%, whereas that for cancer type localization is merely 83%. Therefore, it is demanding to develop an efficient model to precisely locate different cancers. We introduced CancerTL as a deep learning-based model for cancer localization. Specifically, the model leverages transfer learning to extract prior knowledge from the source cancer pair to improve the localization of the target cancer pair. Experiments indicate that CancerTL significantly outperforms the corresponding deep learning model with the same neural architecture (CancerDL), achieving an average AUC of 97.61%. To further explain and scale our model, we interpreted the results via feature importance analysis with the Cancer-Biomarker network to visualize and explain the effect of knowledge transfer in CancerTL. Lastly, all of the selected binary classification models (CancerTL) were integrated into a multiple cancer-type classifier (Integrated CancerTL) for final performance. The performance acquires an average AUC of 92%, outperforming the current state-of-the-art models such as CancerSEEK (78%) and CancerA1DE (83%).
In the first topic, we propose a method to identify RNA G-quadruplex (rG4) secondary structure on mRNA. Guanine (G)-rich sequences in RNA can fold into diverse rG4 structures to mediate various biological functions and cellular processes in eukaryotic organisms. However, the presence, locations, and functions of rG4s in prokaryotes are still elusive. We used QUMA-1, an rG4-specific fluorescent probe, to detect rG4 structures in a wide range of bacterial species both in vitro and in vivo and found rG4 to be an abundant RNA secondary structure across those species. Subsequently, to identify bacterial rG4 sites in transcriptome, the model Escherichia coli strain and a major human pathogen, Pseudomonas aeruginosa, were subjected to recently developed high-throughput rG4 structure sequencing (rG4-seq). In total, 168 and 161 in vitro rG4 sites were found in E. coli and P. aeruginosa, respectively. The genes with rG4 sites were found to be involved in virulence, gene regulation, cell envelope synthesis, and metabolism. More importantly, biophysical assays revealed the formation of a group of rG4 sites in mRNAs (such as hemL and bswR), and they were functionally validated in cells by genetic (point mutation and lux reporter assays) and phenotypic experiments, providing substantial evidence for the formation and function of rG4s in bacteria. Overall, the proposed method uncovers important regulatory functions of rG4s in bacterial pathogenicity and metabolic pathways and strongly suggests that rG4s exist and can be detected in a wide range of bacterial species.
In the second topic, we propose a method to identify 2′-O-Methyl (Nm) methylation modification sites. Nm is an abundant methylated modification in all kinds of RNA species in eukaryotic cells. Although Nm is present in the ribosomal RNA of bacteria, its occurrence in mRNA still remains elusive. Here, we used a new developed high-throughput Nm sequencing (Nm-seq) to identify the Nm sites in mRNA from two model bacterial species (Escherichia coli and Pseudomonas aeruginosa). We totally identified 163 Nm sites in E. coli mRNA and 300 Nm sites in P. aeruginosa mRNA, suggesting the abundant occurrence of Nm modification in these two bacteria. Our results revealed the distinct Nm distribution pattern in bacterial mRNA. We found more than half of the Nm sites located in CDS and occurred in one gene with a 10-bp conserved motif (NmGACUGUGANA). Nm sites showed different base and codon preference between E. coli and P. aeruginosa. More Nm sites are present in the second base in a codon as compared with other positions. Nm modified genes played potential functional roles in oxidation-reduction, transcriptional regulation and transmembrane transport. In addition, it’s important that a group of Nm sites are present in virulence genes (such as lasI, gacA, tssA1, vfr, vqsR, amrZ and kinB) in P. aeruginosa. Taken together, this study uncovered hundreds of Nm sites in bacterial mRNA with important biological roles.
In the third topic, we proposed a deep transfer learning model to localize cancers based on blood molecules. Cancer is prevalent with high mortality across different organs and tissues. Although late-stage cancer is still challenging to be addressed, early-stage cancer normally has a higher survival rate than its late-stage counterpart. Recently, computational models have been proposed for cancer detection and achieved promising performance. However, existing methods suffer from their low performance in cancer localization. Numerically, CancerA1DE improves the AUC of binary cancer detection from 89% to 99%, whereas that for cancer type localization is merely 83%. Therefore, it is demanding to develop an efficient model to precisely locate different cancers. We introduced CancerTL as a deep learning-based model for cancer localization. Specifically, the model leverages transfer learning to extract prior knowledge from the source cancer pair to improve the localization of the target cancer pair. Experiments indicate that CancerTL significantly outperforms the corresponding deep learning model with the same neural architecture (CancerDL), achieving an average AUC of 97.61%. To further explain and scale our model, we interpreted the results via feature importance analysis with the Cancer-Biomarker network to visualize and explain the effect of knowledge transfer in CancerTL. Lastly, all of the selected binary classification models (CancerTL) were integrated into a multiple cancer-type classifier (Integrated CancerTL) for final performance. The performance acquires an average AUC of 92%, outperforming the current state-of-the-art models such as CancerSEEK (78%) and CancerA1DE (83%).