Abstract
Ribonucleic acid (RNA) is a class of single-stranded linear macromolecules formed by the polymerization of ribonucleotides through 3',5'-phosphodiester bonds, consisting of four bases: adenine (A), uracil (U), guanine (G), and cytosine (C). According to the central dogma, DNA can be transcribed into RNA, and RNA can be translated into proteins. However, only a minority of RNAs are translated into proteins. Based on whether an RNA is translated into a protein, RNAs can be divided into protein-coding RNAs (mRNAs) and non-coding RNAs (ncRNAs). Non-coding RNAs (ncRNAs) can be further classified into microRNAs, lncRNAs, circRNAs, piRNAs, and other types.Although ncRNAs do not translate into proteins, they play essential roles in cellular life activities. For example, microRNAs can participate in post-transcriptional gene regulation to modulate target gene expression, subsequently affecting disease occurrence and development. LncRNAs have various regulatory modes, one acting as competing endogenous RNAs (ceRNAs) for regulation. Long non-coding RNAs (lncRNAs) can participate in the expression regulation of their target genes by binding to microRNAs (miRNAs). Additionally, lncRNAs are involved in epigenetic regulation, cell cycle regulation, cell differentiation regulation, and immune responses and are closely related to developing autoimmune diseases through different mechanisms.
Although ncRNAs do not directly participate in protein generation, they can regulate the expression of related upstream and downstream genes. Therefore, identifying ncRNA targets could be meaningful in slowing disease progression and finding potential disease treatment strategies.
Wet lab validation can accurately reveal small biological moleculesā relationships and mechanisms of action, including ncRNA. However, wet experiments have disadvantages, including expensive consumables, long experimental process time, and the need for many experimental personnel to follow up. In silico computational experiments conducted through computational models or simulations offer a promising and efficient alternative. Machine learning-based methods can perform large-scale virtual screening of ncRNAs and targets, thereby improving the efficiency of discovering relationships and validating them through wet lab experiments.
Due to the crucial roles of RNAs in living organisms, accurately predicting ncRNAs and their potential targets is of great importance. However, compared to double-stranded DNA, the single-stranded structure of RNA makes its spatial structure more complex and variable, rendering the prediction of RNA-target relationships more challenging. Traditional methods for predicting RNA targets, such as calculating molecular free energy or using simple sequence-based statistical features for potential target recommendation, have limitations and issues, including poor prediction performance and high false-positive rates. With the development of deep learning techniques, deep learning as a prediction or feature extraction method has become popular. Features extracted using deep learning methods are often referred to as embeddings. These embeddings can extract more profound latent features from different aspects, such as computer vision, complex networks, and natural language, to assist downstream machine learning methods in determining potential relationships.
Currently, machine learning methods based on embeddings have achieved excellent prediction performance in many cross-disciplinary prediction tasks. In addition to using deep learning embeddings to improve prediction performance in feature extraction, another machine learning strategy is ensemble learning, which involves constructing sub-models or base models and adopting a voting approach to obtain the final predictive result. Ensemble learning methods have been proven to improve prediction performance in many algorithms.
Therefore, by combining deep learning embedding information and ensemble learning methods, we proposed three frameworks predicting microRNA, lncRNA, and viral RNA targets. We constructed the SRG-Vote algorithm for predicting microRNA target genes based on sequence and network embeddings, the lncRNA-top algorithm for predicting lncRNA target genes based on varied sequence feature embeddings, and the deepseq2drug platform for predicting virus-target drugs based on sequence feature embeddings, computer-vision based embedding, network-based embeddings, and natural language-based embeddings. LSTM or bi-LSTM models were implemented as the base models for miRNA and gene targets for ensemble learning. The CNN and RF were the base models for predicting lncRNA and its targets in lncRNA-top. Regarding virus and drug relationship prediction, since deep learning models were utilized in the feature/embedding extraction process, only RF was used as the base model for ensemble learning in the final prediction stage. Among them, lncRNA-top and deepseq2drug provide executing programs and websites for readers to use conveniently specifically, at http://lncrna.cs.cityu.edu.hk/ and http://deepseq2drug.cs.cityu.edu.hk/.
| Date of Award | 9 Sept 2024 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Ka Chun WONG (Supervisor) |