Dual Development of Deleterious Prediction Models for DNA-binding Specificities of Human Transcription Factors on Both Sides: DNA Binding Regions versus Protein Coding Regions

Project: Research

View graph of relations

Description

In human, the protein-DNA binding interactions between transcription factors (TFs)and transcription factor binding sites (TFBSs) are the fundamental steps in controllingupstream gene expression. Determining the binding specificities forms the basis fordownstream gene regulation analysis. Although the existing deleterious variantpredictions have been well-studied on the observed single nucleotide polymorphism(SNP) variations related to TFs and TFBSs, its performance on unobserved variationsremains speculative. Therefore, we aim at developing novel deleterious variant predictionmodels for the DNA-binding specificities of human TFs on both sides (DNA-bindingregions versus protein-coding regions). Initial testing results are given for each projectphase.On the DNA-binding regions, we propose to develop the first in silico k-spectrumrecognition model for encoding the DNA-binding specificities of human transcriptionfactors. The k-spectrum model not only can predict TFBS sequence patterns from theprotein sequences of TFs, but also can achieve the k-spectrum resolution to such anextent that local sequence context can be taken into account for deleterious variantpredictions of observed/unobserved binding regulatory SNPs (rSNP) on TFBSs bound byhuman TFs. A mathematical example with complexity analysis is given in the main text.(Phase A; Objective 1)On the protein-coding regions, we propose to develop the first deleterious SNPprediction model tailor-made for the non-synonymous SNPs (nsSNPs) of human TFs. Itis important as the past studies usually focus on fitting one model on all human proteinswhile our model is tailor-made for human TFs which DNA-binding characteristics aresubstantially different from the other protein families. Our initial testing model(TFmedic) predictions are found to be more accurate than Harvard PolyPhen-2 andJCVI SIFT as shown in Figure 4. (Phase B; Objective 2)Given the developed models on both sides from Phases A and B, we aim at integratingthe models together to identify and prioritize the genome-wide co-variated eQTL pairsbetween binding rSNPs on TFBSs and nsSNPs on TFs' coding regions from the existingeQTL studies in the context of the DNA-binding specificities of human TFs. (Phase C;Objective 3)At the end, the developed models and resultant data will be tested and released as open-sourcesoftware and public datasets respectively for scientific reproducibility. PI's pastPhD supervisor from Toronto, Zhaolei Zhang, has agreed to provide biochemical crossvalidationsfor TF-TFBS interactions; his support letter is attached. (Phase C; Objective4)For illustrations, the whole research plan is outlined in Figure 1.

Detail(s)

Project number9042480
Grant typeGRF
StatusFinished
Effective start/end date1/12/1724/11/21