New Methods for Prioritizing Disease Causal Genes from Omics Data


Student thesis: Doctoral Thesis

View graph of relations


  • Lu ZHANG

Related Research Unit(s)


Awarding Institution
Award date5 Jan 2016


With the general availability of next generation sequencing techniques in the last decade, many projects have been initiated to work with the technique to analyze disease associated genes for human Mendelian and complex diseases. However, it remains difficult to identify disease associated genes due to three problems. First, the computational capacity is overwhelmed by the very large amount of raw data, these data require tremendous efforts to analyze as well as to extract valuable information. Many software have been developed previously, but true mutations may still be lost when merely relying on the individual one. Second, the relations between gene functions and phenotypes are still unclear. Only a small disease heritability is interpreted by those disease known genes, not only the new genes but also their interactions may define the new pathogenesis. However, it is an uneasy task by relaying on available resources. Third, gene-gene interactions can be evaluated by gene expression or protein abundance, but most of the projects only perform genome sequencing without any matched RNA-seq or proteome, which prevents the construction of powerful gene-gene interactions network.
To address these problems, I develop several novel approaches for analyzing genomic and transcriptomic data from next generation sequencing, the prediction of gene-gene interactions and their directions from gene expression data, as well as for the inference of microRNA targets by trans-omics data. First, I construct a comprehensive analysis toolkit to analyze genomic and transcriptomic data from next generation sequencing. For genomic data, we introduce VarCus, a tool that incorporates several state-of-the-art alignment and variant call software, and formalizes them into customized pipelines. The software changed for reads alignment or variant call may lead to a dramatical decrease of the correlations between the results from discrepant pipelines. VarCus uses a novel approach to integrate those pipelines by three strategies: hard integration, unsupervised integration and supervised integration. For transcriptomic data analysis, we design CloudRseq which including two components: CloudmRseq and CloudmiRseq for analyzing mRNA-seq and microRNA-seq, respectively. CloudRseq is able to compute essential information such as gene expression profile, de novo transcripts, novel microRNA, gene coexpression network and etc. The computed information is presented in highly accessible formats, including many useful summarization and visualization. Next, we analyze a large-scale gene expression data from publicly available databases and develop two novel methods to accurately explore the gene-gene interactions from integrated gene expression data. A Multimodal framework (MMF) is designed to deal with gene expression data with large sample size by assuming that the distribution for each gene follows a Gaussian Mixture Model rather than a normal distribution. MMF is further implemented to Multimodal Mutual Information and Multimodal Direct Information. Results show that MMF significantly outperforms other methods in the discovery of gene-gene interactions, with or without transitive relations. Since the regulatory relationship between genes is believed to be directed, we propose context-based dependency network (CBDN) to 1. determine the relationship direction according to the influence function; 2. remove transitive interactions by directed data processing inequality; and 3. identify the important regulators. Besides gene expression data, we develop a tool called MicroTrans by employing mRNA sequencing, microRNA sequencing and degradome sequencing data to offer additional information in predicting the microRNA targets. MicroTrans incorporates to predict microRNA target genes. MicroTrans is tested on pepper; the results suggest trans-omics data can predict the microRNA-mRNA interactions more accurately. I believe our methods and findings can bring us one step closer to precision medicine, and will find wide application when whole genome sequencing becomes ubiquitous.