On triangle inequalities of correlation-based distances for gene expression profiles
Research output: Journal Publications and Reviews (RGC: 21, 22, 62) › 21_Publication in refereed journal › peer-review
Author(s)
Related Research Unit(s)
Detail(s)
Original language | English |
---|---|
Article number | 40 |
Journal / Publication | BMC Bioinformatics |
Volume | 24 |
Online published | 8 Feb 2023 |
Publication status | Published - 2023 |
Link(s)
DOI | DOI |
---|---|
Attachment(s) | Documents
Publisher's Copyright Statement
|
Link to Scopus | https://www.scopus.com/record/display.uri?eid=2-s2.0-85147722871&origin=recordpage |
Permanent Link | https://scholars.cityu.edu.hk/en/publications/publication(c127db08-9de9-4fa1-beff-5284874729f3).html |
Abstract
Background: Distance functions are fundamental for evaluating the differences between gene expression profiles. Such a function would output a low value if the profiles are strongly correlated—either negatively or positively—and vice versa. One popular distance function is the absolute correlation distance, da= 1 - |ρ| , where ρ is similarity measure, such as Pearson or Spearman correlation. However, the absolute correlation distance fails to fulfill the triangle inequality, which would have guaranteed better performance at vector quantization, allowed fast data localization, as well as accelerated data clustering. Results: In this work, we propose dr = √1 - |ρ| as an alternative. We prove that dr satisfies the triangle inequality when ρ represents Pearson correlation, Spearman correlation, or Cosine similarity. We show dr to be better than ds = √1 - ρ2, another variant of da that satisfies the triangle inequality, both analytically as well as experimentally. We empirically compared dr with da in gene clustering and sample clustering experiment by real-world biological data. The two distances performed similarly in both gene clustering and sample clustering in hierarchical clustering and PAM (partitioning around medoids) clustering. However, dr demonstrated more robust clustering. According to the bootstrap experiment, dr generated more robust sample pair partition more frequently (P-value < 0.05). The statistics on the time a class “dissolved” also support the advantage of dr in robustness. Conclusion: dr, as a variant of absolute correlation distance, satisfies the triangle inequality and is capable for more robust clustering. © 2023, The Author(s).
Research Area(s)
- Clustering, Correlation, Distance, Gene expression analysis, Single cell, Triangle inequality
Citation Format(s)
On triangle inequalities of correlation-based distances for gene expression profiles. / Chen, Jiaxing; Ng, Yen Kaow; Lin, Lu et al.
In: BMC Bioinformatics, Vol. 24, 40, 2023.Research output: Journal Publications and Reviews (RGC: 21, 22, 62) › 21_Publication in refereed journal › peer-review
Download Statistics
No data available