Skip to main navigation Skip to search Skip to main content

GMHCC: high-throughput analysis of biomolecular data using graph-based multiple hierarchical consensus clustering

Yifu Lu, Zhuohan Yu, Yunhe Wang, Zhiqiang Ma, Ka-Chun Wong, Xiangtao Li*

*Corresponding author for this work

Research output: Journal Publications and ReviewsRGC 21 - Publication in refereed journalpeer-review

Abstract

Motivation: Thanks to the development of high-throughput sequencing technologies, massive amounts of various biomolecular data have been accumulated to revolutionize the study of genomics and molecular biology. One of the main challenges in analyzing this biomolecular data is to cluster their subtypes into subpopulations to facilitate subsequent downstream analysis. Recently, many clustering methods have been developed to address the biomolecular data. However, the computational methods often suffer from many limitations such as high dimensionality, data heterogeneity and noise. Results: In our study, we develop a novel Graph-based Multiple Hierarchical Consensus Clustering (GMHCC) method with an unsupervised graph-based feature ranking (FR) and a graph-based linking method to explore the multiple hierarchical information of the underlying partitions of the consensus clustering for multiple types of biomolecular data. Indeed, we first propose to use a graph-based unsupervised FR model to measure each feature by building a graph over pairwise features and then providing each feature with a rank. Subsequently, to maintain the diversity and robustness of basic partitions (BPs), we propose multiple diverse feature subsets to generate several BPs and then explore the hierarchical structures of the multiple BPs by refining the global consensus function. Finally, we develop a new graph-based linking method, which explicitly considers the relationships between clusters to generate the final partition. Experiments on multiple types of biomolecular data including 35 cancer gene expression datasets and eight single-cell RNA-seq datasets validate the effectiveness of our method over several state-of-the-art consensus clustering approaches. Furthermore, differential gene analysis, gene ontology enrichment analysis and KEGG pathway analysis are conducted, providing novel insights into cell developmental lineages and characterization mechanisms.
Original languageEnglish
Pages (from-to)3020-3028
JournalBioinformatics
Volume38
Issue number11
Online published22 Apr 2022
DOIs
Publication statusPublished - 1 Jun 2022

Funding

The work described in this article was substantially supported by the National Natural Science Foundation of China [62076109], and also funded by ‘the Fundamental Research Funds for the Central Universities’. The work described in this article was substantially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region [CityU 11200218], a grant from the Health and Medical Research Fund, of the Food and Health Bureau, The Government of the Hong Kong Special Administrative Region [07181426], and the funding from the Hong Kong Institute for Data Science (HKIDS) at the City University of Hong Kong. The work described in this paper was partially supported by two grants from the City University of Hong Kong (CityU 11202219, CityU 11203520).

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being

Research Keywords

  • RNA-SEQ DATA
  • EXPRESSION

RGC Funding Information

  • RGC-funded

Fingerprint

Dive into the research topics of 'GMHCC: high-throughput analysis of biomolecular data using graph-based multiple hierarchical consensus clustering'. Together they form a unique fingerprint.

Cite this