Abstract
Co-clustering, which simultaneously groups both rows and columns of data matrices, has emerged as a powerful technique for discovering latent patterns in high-dimensional datasets by exploiting the duality between data dimensions. While traditional clustering methods treat rows and columns independently, co-clustering reveals hidden block structures that capture meaningful relationships between entities and features simultaneously. Despite theoretical advances and successful applications in domains such as text mining and bioinformatics, co-clustering faces fundamental challenges in scalability, adaptability, and extension to complex multi-dimensional data structures, limiting its applicability to modern large-scale and multi-modal datasets.This thesis presents a comprehensive framework that advances co-clustering theory and applications across three interconnected domains: scalable distributed computing, geometric pattern detection, and biological systems analysis. Through rigorous theoretical development, principled algorithmic innovation, and extensive experimental validation, we establish co-clustering as a versatile framework capable of addressing fundamental challenges in large-scale data analysis while providing theoretical guarantees for performance and reliability.
Our first contribution, DiMergeCo, is the first distributed co-clustering algorithm with probabilistic guarantees for pattern preservation. It features a novel matrix partitioning method and a hierarchical merging strategy that reduces communication complexity to O(log n). DiMergeCo is designed to handle large-scale data efficiently, demonstrating an 83% reduction in computation time on dense matrices and 30% on sparse matrices, while successfully processing million-dimensional datasets.
Our second contribution is an information-theoretic framework for ellipse detection, which pioneers the use of Jensen-Shannon divergence for geometric pattern matching. This framework combines probabilistic arc modeling, adaptive clustering via SVD, and a contrario statistical validation for robustness in noisy and occluded environments. It achieves state-of-the-art performance in challenging real-world measurement systems.
Our third contribution, DiMergeTCC, extends co-clustering to three-mode tensors (e.g., cells × time × features), preserving higher-order biological relationships with probabilistic guarantees. This marks the first distributed tensor co-clustering approach of its kind. Applied to Caenorhabditis elegans morphogenesis analysis, it improves time alignment by 22–34% and recovers biological processes with high statistical significance.
The comprehensive evaluation across synthetic and real-world datasets validates the effectiveness of our integrated approach. The distributed co-clustering framework enables analysis of datasets previously beyond computational reach while maintaining theoretical guarantees. The information-theoretic geometric detection framework demonstrates superior robustness in challenging measurement scenarios including industrial inspection and medical imaging. The tensor co-clustering framework establishes new paradigms for hypothesis-free biological discovery, providing interpretable insights into complex developmental processes.
This research establishes co-clustering as a unified framework for analyzing complex, multi-dimensional datasets across diverse domains. By combining rigorous theoretical analysis with principled algorithmic development, we demonstrate that co-clustering principles can be successfully extended to address fundamental challenges in distributed computing, computer vision, and computational biology. The integrated approach provides a template for future research that bridges theoretical computer science with practical applications, potentially leading to new paradigms in distributed machine learning, geometric pattern recognition, and biological systems analysis.
| Date of Award | 11 Nov 2025 |
|---|---|
| Original language | English |
| Awarding Institution |
|
| Supervisor | Hong YAN (Supervisor) |
Cite this
- Standard