Methods and Applications for Omics-abundance Analysis
組學豐度分析的方法和應用
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 20 Jun 2019 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(72779beb-2ead-435b-97c2-713434430901).html |
---|---|
Other link(s) | Links |
Abstract
Omics-abundance is the abundance table generated in various biological omics analysis. "omics" refers to a field of study in biology ending in -omics, including transcriptomics, metagenomics, proteomics, etc. Although by now the analysis on gene expression in transcriptomics, microbial abundance, proteomics are different research areas, we claim that they share similarity at abundance analysis and it is feasible to integrate the analysis with borrowing ideas from each other. We designed a framework based on the network structure to link multiple data analysis algorithm together. We integrated the widely used software and constructed a pipeline. While investigating existing software, we found several limitations in current method and proposed approaches for improvement at three aspects: distance, network comparison, and key driver detection.
First, the absolute correlation distance da = 1 − |ρ| for evaluating the differences between gene expression profiles are widely used, where ρ is some similarity measures, such as Pearson or Spearman correlation. Such a function would output a low value if the profiles are strongly correlated—either negatively or positively—and vice versa. However, absolute correlation distance fails to fulfill the triangular inequality, which would have guaranteed better performance at vector quantization, allowed fast data localization, as well as sped up data clustering. We propose dr = sqrt(1 − |ρ|) as an alternative and proved it satisfy triangular inequality. dr and da have comparable performance in clustering experiment on real biological data, while dr outperforms in robustness. Second, to compare two networks, researchers usually measure the similarity of corresponding nodes. The node similarity is often defined as the neighbor similarities of nodes. The neighbor similarity calculations often involve threshold setting, which is ad hoc. That motivated us to propose an Eigenvector based method to measure nodes conservation without threshold setting. We first align the nodes of the input networks by a weighted bipartite matching and form an aligned network. We assign each edge of the aligned network a mapping score, which is calculated from the corresponding edges in the two input networks. We then calculate the Eigenvector of the mapping score matrix, and used it as the node conservation and detected the conserved network. We implemented the proposed method and benchmarked it with ComPIEx. Our method is more stable across different noise level. By applying our method, we identified the conserved and divergent part in the molecular ecological network across oral microbiome in individuals with and without rheumatoid arthritis and discovered that abnormal reactive oxygen level and defective DNA mismatch repair system are associated with rheumatoid arthritis (RA). Third, we propose a novel method to detect the key drivers, which drive the disease concerned subnetwork. We apply the method in the scope of microbial abundance. By investigating the abundance associations of species or genes, we construct molecular ecological networks (MENs). We first partition the MEN into subnetworks. Then we identify the most pertinent subnetworks to the disease by measuring the correlation between the abundance pattern and the delegated phenotype—the variable representing the disease phenotypes. Last, for each identified subnetwork, we detected the key driver by PageRank. We detected subnetworks related to RA (a disease caused by compromised immune systems), which include InterPro matches (IPRs) concerned with immunoglobulin, Sporulation, biofilm, Flaviviruses, bacteriophage, etc., while the development of biofilms is regarded as one of the drivers of persistent infections.
Besides the proposed methods, we also present data analysis experiment and findings in two data-driven analysis. By analyzing based on transcriptomics abundance (RNA-seq) of pigs, we observed asynchronized development between two subspecies via time-series analysis. By analyzing microbial abundance (16rs) collected from 2058 individuals, including obesity and normal weight, we investigated the association between microbial abundance and the phenotype information as well as observed two types of obesity. We believe our methods can shed light on omics-abundance analysis and the experiment on involved disease can provide hints to their biological understanding.
- Omics-abundance, transcriptomics, metagenomics, network analysis, time-series analysis, distance, triangular inequality