Skip to main navigation Skip to search Skip to main content

Development of Scalable and Adaptive Co-Clustering Algorithms and Their Applications

Student thesis: Doctoral Thesis

Abstract

Co-clustering, which simultaneously groups both rows and columns of data matrices, has emerged as a powerful technique for discovering latent patterns in high-dimensional datasets by exploiting the duality between data dimensions. While traditional clustering methods treat rows and columns independently, co-clustering reveals hidden block structures that capture meaningful relationships between entities and features simultaneously. Despite theoretical advances and successful applications in domains such as text mining and bioinformatics, co-clustering faces fundamental challenges in scalability, adaptability, and extension to complex multi-dimensional data structures, limiting its applicability to modern large-scale and multi-modal datasets.

This thesis presents a comprehensive framework that advances co-clustering theory and applications across three interconnected domains: scalable distributed computing, geometric pattern detection, and biological systems analysis. Through rigorous theoretical development, principled algorithmic innovation, and extensive experimental validation, we establish co-clustering as a versatile framework capable of addressing fundamental challenges in large-scale data analysis while providing theoretical guarantees for performance and reliability.

Our first contribution, DiMergeCo, is the first distributed co-clustering algorithm with probabilistic guarantees for pattern preservation. It features a novel matrix partitioning method and a hierarchical merging strategy that reduces communication complexity to O(log n). DiMergeCo is designed to handle large-scale data efficiently, demonstrating an 83% reduction in computation time on dense matrices and 30% on sparse matrices, while successfully processing million-dimensional datasets.

Our second contribution is an information-theoretic framework for ellipse detection, which pioneers the use of Jensen-Shannon divergence for geometric pattern matching. This framework combines probabilistic arc modeling, adaptive clustering via SVD, and a contrario statistical validation for robustness in noisy and occluded environments. It achieves state-of-the-art performance in challenging real-world measurement systems.

Our third contribution, DiMergeTCC, extends co-clustering to three-mode tensors (e.g., cells × time × features), preserving higher-order biological relationships with probabilistic guarantees. This marks the first distributed tensor co-clustering approach of its kind. Applied to Caenorhabditis elegans morphogenesis analysis, it improves time alignment by 22–34% and recovers biological processes with high statistical significance.

The comprehensive evaluation across synthetic and real-world datasets validates the effectiveness of our integrated approach. The distributed co-clustering framework enables analysis of datasets previously beyond computational reach while maintaining theoretical guarantees. The information-theoretic geometric detection framework demonstrates superior robustness in challenging measurement scenarios including industrial inspection and medical imaging. The tensor co-clustering framework establishes new paradigms for hypothesis-free biological discovery, providing interpretable insights into complex developmental processes.

This research establishes co-clustering as a unified framework for analyzing complex, multi-dimensional datasets across diverse domains. By combining rigorous theoretical analysis with principled algorithmic development, we demonstrate that co-clustering principles can be successfully extended to address fundamental challenges in distributed computing, computer vision, and computational biology. The integrated approach provides a template for future research that bridges theoretical computer science with practical applications, potentially leading to new paradigms in distributed machine learning, geometric pattern recognition, and biological systems analysis.
Date of Award11 Nov 2025
Original languageEnglish
Awarding Institution
  • City University of Hong Kong
SupervisorHong YAN (Supervisor)

Cite this

'