Abstract
The rapid evolution of high-throughput omics technologies has revolutionized our understanding of cellular systems by generating vast and complex biological datasets. However, these advances also present significant challenges in data analysis, including the need for robust clustering methods, accurate haplotype reconstruction in genomically altered tissues, and the clear visualization of multidimensional biological flows. This thesis addresses these challenges by developing three novel computational frameworks that extend current methodologies and offer enhanced interpretability and performance in biological data analysis.In Chapter 1, a comprehensive introduction outlines the current challenges in omics data analysis, with a focus on the limitations of traditional clustering, haplotyping, and visualization methods in capturing the complexity of biological systems.
In Chapter 2, we introduce CeiTEA, an adaptive hierarchical clustering algorithm tailored for single-cell RNA sequencing data. Unlike traditional clustering methods that force binary or balanced tree structures, CeiTEA employs a newly proposed topological entropy (TE) measure alongside eigen-decomposition and integer linear programming to construct an unconstrained, multi-nary hierarchical tree. This approach captures both the vertical (depth) and horizontal (breadth) diversity of cell types and subtypes. Extensive evaluations on simulated and real-world datasets demonstrate that CeiTEA outperforms state-of-the-art clustering tools by recovering true cell population structures with high accuracy and lower entropy, ultimately delivering enhanced biological interpretability.
In Chapter 3, we present CNAHap, a novel germline haplotyping method specifically designed for tumor genomes exhibiting allele-specific copy number alterations (CNAs). CNAHap leverages the inherent imbalance in allele depths caused by CNAs to phase heterozygous single-nucleotide variants (SNVs) and small insertions and deletions, overcoming the limitations imposed by short-read sequencing. By formulating an integer programming model that integrates tumor purity, allele depth measurements, and segment-based copy number estimates, CNAHap reconstructs extended haplotype blocks even in challenging genomic regions. Validation on both in silico datasets and a hepatocellular carcinoma cohort confirms that CNAHap significantly improves phasing rates and block lengths, demonstrating its potential for clinical genomics and the study of tumor evolution.
In Chapter 4, we address a critical need in the visualization of complex biological data with OmicsSankey. Modern omics datasets often require the depiction of intricate flows between entities, yet standard Sankey diagram layouts can become cluttered by excessive edge crossings. To resolve this, OmicsSankey reformulates the barycentric layout problem using global spectral techniques combined with a teleportation mechanism that ensures robust connectivity in ill-defined graphs. An additional micro-dissection step refines vertex placements within predefined blocks, further reducing edge crossings and enhancing the clarity of the final visualization. Benchmarking on synthetic and real datasets highlights that OmicsSankey not only improves visual interpretability but also achieves substantial computational efficiency compared to existing approaches.
Collectively, the contributions presented in this thesis deliver a robust suite of computational tools that improve the analysis and interpretation of high-dimensional omics data. By advancing methodologies for adaptive clustering, accurate haplotyping in complex tumor genomes, and optimized data visualization, this work provides potential insights into cellular differentiation, tumor evolution, and systems-level biological interactions. The frameworks developed here also pave the way for future research in personalized medicine and integrative biological studies.
| Date of Award | 5 Sept 2025 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Shuaicheng LI (Supervisor) |