Abstract
Eukaryotic genomes are intricately folded within three-dimensional (3D) space to facilitate complex regulatory functions. Recent advancements in high-throughput chromosome conformation capture (Hi-C) technologies have driven substantial progress in studying the patterns of 3D chromatin organization and their functional implications. Studies utilizing Hi-C technologies have uncovered a complex, multi-layered organization of the 3D genome. This hierarchical structure spans various length scales, featuring key architectural elements including chromatin loops, topologically associating domains (TADs), and compartments. These architectural features are ubiquitous across the genomes of different species and play critical roles in various biological processes. Detecting these features in the genome can provide valuable insights into the roles the 3D genome plays in gene regulation. In this thesis, we aim to tackle the task of identifying architectural features in 3D genomics data by leveraging recent developments in deep learning. Our studies strategically exploit the multi-view nature of Hi-C data to effectively handle varying sequencing depths and data types, thereby providing reliable tools for downstream structural and functional analyses.The first piece of work described in this thesis focuses on loop detection in bulk Hi-C data. Bulk Hi-C involves applying Hi-C technology to a considerable number of cells, allowing for the capture of average chromatin interaction profiles within a cell population, such as a specific cell type or cell line. Although numerous enrichmentbased detection methods have been developed, these methods typically suffer from performance deterioration due to insufficient sequencing coverage. To address this issue, we propose a deep learning model that can robustly learn to identify the CTCFmediated loops across different sequencing depths. It is demonstrated that our model outperforms the state-of-the-art machine learning-based method trained using the same data. In this work, we also propose that Hi-C data are multi-view in nature, providing insights into the modeling techniques of Hi-C contact maps. This is exemplified by the dual-branch architecture of our neural network, which learns from both the image view and the graph view of the contact maps. With the enriched knowledge supplied by both views, the network exhibits robustness across multiple sequencing depths.
Single-cell Hi-C (scHi-C) is performed at the level of individual cells to capture the cell-to-cell variability of 3D genome. Despite recent advances, scHi-C data are extremely sparse, posing significant challenges for the development of computational tools. Due to this sparsity, it is practically infeasible to apply traditional enrichmentbased methods on individual single-cell contact maps to detect loops. A compromise method for scHi-C loop calling involves imputing the contact maps and then conducting statistical tests across multiple cells for each potential entry. This approach allows the identification of loops at the cell population level, but at the cost of losing singlecell information in the final loop annotations. Furthermore, this method is considered computationally expensive. To address these issues, we develop a lightweight neural network for loop calling at the single-cell level. The multi-view nature of Hi-C data is further investigated and leveraged in this work. Considering the significant collapse of the image view in scHi-C data due to sparsity, our modeling relies on the graph view and the sequence view of the data. The proposed model not only outperforms the existing method at the cell-type level, but also successfully captures the consistency and heterogeneity of loops across single cells. We also carry out a study on inferring multi-connected hubs from single-cell loops to showcase the application of these loops in the downstream analysis.
In the third piece of work, we build upon the framework for identifying chromatin loops within single cells to develop a method that detects higher-order chromatin structures, including single-cell TAD-like domains (TLDs) and A/B compartments. The model utilizes multi-task learning techniques, enabling it to simultaneously refine single-cell loop predictions and capture the comprehensive information about the 3D architecture of the genome. Harnessing the learned representations of single-cell genomic loci, we adopt sequence-based unsupervised learning approaches to identify TLDs and A/B compartments within individual cells. This model provides a unified framework for the integrated analysis of multi-scale architectural features in an imputation-free manner. Leveraging the single-cell architectural features detected by our model, we can effectively characterize their heterogeneity across diverse cell types. Moreover, we analyze the predictive power of different architectural feature sets for cell type classification. This analysis also leads to a novel method for pinpointing chromatin loop anchors that are critical for cell identity determination.
| Date of Award | 22 Oct 2024 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Ka Chun WONG (Supervisor) |
Cite this
- Standard