Finding Topologically Associating Domains with Structural Information
利用結構熵識別拓撲關聯域
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 5 Sep 2022 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(80bde991-caee-4053-a8b3-8dae27f6ced7).html |
---|---|
Other link(s) | Links |
Abstract
Topologically associating domains (TADs) are now understood as the organization units of chromosome structures. Genomic sequences within a TAD interact with each other more intensely than those belonging to different TADs. TAD can be further subdivided into sub-topologies (sub-TADs), and proximal TADs can aggregate into a higher-order structural domain (meta-TAD), resulting in a hierarchy of the TADs. TADs are found to be conserved between cell types and species. The boundary of TADs can obstruct the spread of transcriptional activities and harbor inhibiting factors such as CTCF binding sites, cohesin complexes, and housekeeping gene TSSs, SINE retrotransposons. Disruption in the TADs may lead to the development of diseases such as cancer.
The design of ligated read sequencing provides us with the opportunity to study the whole-genome contact map of genomic regions at single-nucleosome resolution. Many computational methods for detecting the hierarchical structure of TADs focus on identifying the local dense domains or potential TAD boundaries, but fail to capture the long-range interactions between TADs. The graph-based methods embed the spatial proximity of the genome and interpret the contact map as a weighted undirected graph, providing a more detailed landscape of chromosome conformation. The work in this thesis focuses on applying structural information theory to dissect the hierarchy of topologically associating domains. The studies in each chapter are described below:
Chapter 1: An introduction to the background for 3D genome research and existing computational methods for TAD detection. We discuss the difference across genome topologies, the advantages and disadvantages of various TAD detection algorithms, and the application of graph structural information theory to TAD detection tasks.
Chapter 2: This chapter introduces a robust detection method of hierarchical TADs, SuperTAD, to find the hierarchical partitioning of bins with minimum structural information. We prove that finding an optimal coding tree from a DNA contact map is polynomial-time solvable. We design an optimal algorithm using dynamic programming to compute the coding tree with minimal structural information. Experiments in simulated and real data show that SuperTAD outperforms seven existing methods in accuracy and consistency.
Chapter 3: In this chapter, we collect the RNA-DNA interaction and RNA-RNA interaction sequencing data (RNA-associated interactions, aka RAI) and find the TAD-like domains (TLDs) with minimized structural information theory. We propose SuperTLD, an imputation-based domain detection method. SuperTLD first imputes the missing interaction frequencies through a negative binomial model with a mean-variance linear dependency for genes. Then a Bayesian correction is incorporated into the structural information theory to detect the hierarchical domains from the imputed RAIs. The inferred TLDs are found to share a moderate structural similarity with TADs, and the novel TLDs exhibit enriched CTCF at boundaries and enriched H3 histone marks within domains.
Chapter 4: We perform an exhaustive exploration of spectral clustering methods on the genomic state-transition site identification. We systematically compare the clustering performance utilizing a single eigenvector of the Fiedler vector and other eigenvectors, as well as the clustering performance utilizing multiple eigenvectors from six forms of graph embedding mappings via different combinations. The experiments on simulated data and real Hi-C data prove that not only the top eigenvector has the ability to reflect the feature. The identified state-transition sites from some eigenvectors are found to highly correlate with CTCF enriched bins, histone marks, and TAD boundaries.
Chapter 5: In this chapter, we propose SuperSC that clusters cells based on the contact maps and TAD-like domain structures (TLS) similarity from single-cell Hi-C data. SuperSC assumes the cells within a group tend to share more TAD-like domains than cells belonging to other groups. Structural entropy is adopted to measure the quality of a TLS representing the chromatin structure of a cell and conduct cell clustering. SuperSC imputes the scHi-C data with the 3×3 average filter and random walk with restart algorithm and then performs cell clustering based on the contact map and the TLS embedded with the imputed contact maps. The simulated data experiment shows that SuperSC outperforms deTOKI in the aspect of efficiency and robustness over sparsity, noise, and contact. The single-cell Hi-C data experiment shows that SuperSC efficiently identifies the clusters of cells and cluster-specific TLSs.
Chapter 6: In this chapter, we provide a summary of the thesis and further perspective.
The design of ligated read sequencing provides us with the opportunity to study the whole-genome contact map of genomic regions at single-nucleosome resolution. Many computational methods for detecting the hierarchical structure of TADs focus on identifying the local dense domains or potential TAD boundaries, but fail to capture the long-range interactions between TADs. The graph-based methods embed the spatial proximity of the genome and interpret the contact map as a weighted undirected graph, providing a more detailed landscape of chromosome conformation. The work in this thesis focuses on applying structural information theory to dissect the hierarchy of topologically associating domains. The studies in each chapter are described below:
Chapter 1: An introduction to the background for 3D genome research and existing computational methods for TAD detection. We discuss the difference across genome topologies, the advantages and disadvantages of various TAD detection algorithms, and the application of graph structural information theory to TAD detection tasks.
Chapter 2: This chapter introduces a robust detection method of hierarchical TADs, SuperTAD, to find the hierarchical partitioning of bins with minimum structural information. We prove that finding an optimal coding tree from a DNA contact map is polynomial-time solvable. We design an optimal algorithm using dynamic programming to compute the coding tree with minimal structural information. Experiments in simulated and real data show that SuperTAD outperforms seven existing methods in accuracy and consistency.
Chapter 3: In this chapter, we collect the RNA-DNA interaction and RNA-RNA interaction sequencing data (RNA-associated interactions, aka RAI) and find the TAD-like domains (TLDs) with minimized structural information theory. We propose SuperTLD, an imputation-based domain detection method. SuperTLD first imputes the missing interaction frequencies through a negative binomial model with a mean-variance linear dependency for genes. Then a Bayesian correction is incorporated into the structural information theory to detect the hierarchical domains from the imputed RAIs. The inferred TLDs are found to share a moderate structural similarity with TADs, and the novel TLDs exhibit enriched CTCF at boundaries and enriched H3 histone marks within domains.
Chapter 4: We perform an exhaustive exploration of spectral clustering methods on the genomic state-transition site identification. We systematically compare the clustering performance utilizing a single eigenvector of the Fiedler vector and other eigenvectors, as well as the clustering performance utilizing multiple eigenvectors from six forms of graph embedding mappings via different combinations. The experiments on simulated data and real Hi-C data prove that not only the top eigenvector has the ability to reflect the feature. The identified state-transition sites from some eigenvectors are found to highly correlate with CTCF enriched bins, histone marks, and TAD boundaries.
Chapter 5: In this chapter, we propose SuperSC that clusters cells based on the contact maps and TAD-like domain structures (TLS) similarity from single-cell Hi-C data. SuperSC assumes the cells within a group tend to share more TAD-like domains than cells belonging to other groups. Structural entropy is adopted to measure the quality of a TLS representing the chromatin structure of a cell and conduct cell clustering. SuperSC imputes the scHi-C data with the 3×3 average filter and random walk with restart algorithm and then performs cell clustering based on the contact map and the TLS embedded with the imputed contact maps. The simulated data experiment shows that SuperSC outperforms deTOKI in the aspect of efficiency and robustness over sparsity, noise, and contact. The single-cell Hi-C data experiment shows that SuperSC efficiently identifies the clusters of cells and cluster-specific TLSs.
Chapter 6: In this chapter, we provide a summary of the thesis and further perspective.