DNA Motif Knowledge Extraction and Distillation from Big Deep Learning Models in Regulatory Genomics

Project: Research

View graph of relations


Deep learning has been demonstrated for wide successes. Unfortunately, the black-box nature of deep learning has prohibited itself from the confident adoptions in different genomic settings. In particular, the deep neural network modeling complexity (e.g. DeepSea, Sei, and Enformer) has been grown by several orders of magnitude (e.g. from KBs to GBs); such a model size growthalso complicated its potential uses with confidences in regulatory genomics. In this proposal, we aim at addressing such an aspect by developing knowledge extraction and distillation methods from the big deep learning models in regulatory genomics with backup paths as outlined in Figure 1.In Objective 1, we aim at developing knowledge extraction methods for the deep learning models in regulatory genomics. Specifically, different knowledge extraction paradigms will be explored in the context of regulatory genomics. To assess its feasibility, in the preliminary experiments, we have extracted the first convolution layers of Sei and Enformer which convolution filters are compared with the existing DNA motif knowledge in Jolma, JASPAR, and UniProbe. Interestingly, strong overlaps are observed even just from a single deep learning model. In addition, unknown DNA motif patterns are also identified. Based on the backup results, threeresearch directions are proposed further.In Objective 2, we aim at developing knowledge distillation methods for the deep learning models in regulatory genomics. Specifically, different teacher-student training approaches will be examined. Neural architecture search will also be performed to explore different possible student models for the best performance. To assess its feasibility, we have constructed a student model based on max-pooling from the previously extracted convolution layer of Sei and demonstrated that the resultant model can already outperform Sei in terms of AUPRC which has been published on Nature Genetics in 2022. Interestingly, we found that the performancegain was attributed to a misunderstanding we have made in PyTorch, leading to a myriad of novel insights and multiple possible tasks in this proposal.In Objective 3, as an essential task for scientific reproducibility and impacts, we plan to release the developed models as open-source software and public web service. In particular, we have conceived several directions in the proposal. If the financial and time budget are sufficient, we would also collaborate with wet-lab scientists for genomic verification on the novel genomic patterns arisen (e.g. the unknown DNA motifs in Objective 1 backup results) although individual funding could be expected here.


Project number9043513
Grant typeGRF
StatusNot started
Effective start/end date1/01/24 → …