Deep Learning Models for Genomic Sequences: CRISPR/Cas9 Off-Targets and DNA Motifs


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date10 Aug 2021


Deep learning has made performance breakthroughs in genomics analysis since 2015. In parallel, new biotechnologies such as CRISPR gene editing and new deep learning models such as Transformer keep emerging. There is an urgent demand for applying the-state-of-art deep learning models to decipher the biological processes with the emerging biotechnologies. In this thesis, I attempt to develop deep learning models for two cutting-edge genomic-sequence-based tasks, CRISPR off-target prediction, and heterodimeric DNA motif synthesis. The shared theme of these models is to learn how the genomic sequence pairs encode functional activity and then leverage such knowledge to predict vital biological processes (i.e., CRISPR off-target activity and transcription factor binding cooperativity).

CRISPR-Cas9 system is a groundbreaking tool for gene editing in various species and cell types. The Cas9 is an RNA-guided effector endonuclease protein that cleaves double-stranded DNA at the upstream of a 3-nucleotide protospacer adjacent motif (PAM) bearing sequences complementary to a 20-nucleotide segment in the guide RNA (gRNA). Its adaptability and specificity endow the gene-editing tool with great potential to manipulate the genome in a targeted manner to drive both basic science research and the development of medical therapeutics. Although specific fragments of DNA are aimed by CRISPR gene editing system, single base pair mismatches are often tolerated by gRNA; Cas9 complex can wrongly bind to unintended regions (off-targets) and cleave at those unintended spots as well. Therefore, Understanding whether CRISPR-induced off-target mutations in human genome editing is an essential aspect of risk assessment before developing clinical therapeutics. However, the current laboratory-based assays such as SITE-Seq, GUIDE-Seq, and CIRCLE-Seq are still time-consuming and expensive. These assays not practically feasible to detect off-target sites of vast guide RNA candidates for genome editing experiments. Therefore, it is necessary to use off-target prediction method to prioritize candidate gRNAs. In order to fill this gap, in this thesis, we develop three deep learning models the predict off-target activity by modeling guide RNA and its putative target DNA sequence pairs. We demonstrate that our proposed deep learning models achieve competitive performance on CIRCLE-Seq and GUIDE-seq datasets with indels and mismatches, outperforming the state-of-the-art off-target prediction methods on two independent mismatch-only datasets. Our aggregate model also surpassed a competing method on the gRNA off-target aggregation task. Moreover, we introduce a two-stage sensitivity analysis to visualize the prediction on the gRNA-target pair of interest, demonstrating how implicit knowledge encoded in the deep learning model contributes to accurate off-target activity quantification.

DNA motifs (i.e., transcription factor binding sites) are prevalent and vital for gene regulation in different tissues at different developmental stages of eukaryotes. Although considerable efforts have been made to elucidate monomeric DNA motif patterns, our knowledge of heterodimeric DNA motifs is still far from complete. The current high-throughput assays, CAP-SELEX, have identified over 600 composite DNA sites (i.e.~heterodimeric motifs) bound by cooperative TF pairs. However, there are over 25,000 inferentially effective heterodimeric TFs in human cells. Therefore, we introduce DeepMotifSyn, a deep-learning-based computational tool for synthesizing heterodimeric motifs from monomeric motif pairs. Specifically, DeepMotifSyn is composed of heterodimeric motif generator and evaluator. The generator is a U-Net-based neural network that can synthesize heterodimeric motifs from aligned motif pairs. The evaluator is a machine-learning-based model that can score the generated heterodimeric motif candidates based on the motif sequence features. Systematic evaluations on CAP-SELEX data illustrate that DeepMotifSyn significantly outperforms the current state-of-the-art predictors. In addition, DeepMotifSyn can synthesize multiple heterodimeric motifs with different orientation and spacing settings. Such a feature can address the shortcomings of previous models. We believe DeepMotifSyn is a more practical and reliable model than current predictors on heterodimeric motif synthesis.