High-Throughput Biomolecular Data Representation and Clustering


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date24 Aug 2020


In recent years, the advances in single-cell RNA-seq techniques have enabled us to perform large-scale transcriptomic profiling at single-cell resolution in a high-throughput manner. Transcriptomic profiling is a high-throughput approach to measure gene expression levels under different experimental conditions at different timings. With the development of the related technologies such as single-cell RNA-Seq, the dimensions of gene expression data increase to hundreds of thousands or more for high-resolution insights. However, the transcriptomic profiles are high-dimensional and sparse in nature. There is a long-lasting challenge in high-throughput biomolecular data representation and clustering.

In this thesis, we first review the existing single-cell RNA-seq data clustering methods with critical insights into the related advantages and limitations. In addition, we also review the upstream single-cell RNA-seq data processing techniques (representations) such as quality control, normalization, and dimension reduction. We conduct performance comparison experiments to evaluate several popular single-cell RNA-seq clustering approaches on two single-cell transcriptomic datasets. Furthermore, we present two topics to exploit the relations between transcriptomic profiles and random composite measurements and uncover the key dimensions of high-throughput biomolecular data.

It the first topic, we proposed a mathematical framework based on differential evolution (global search) with the help of compressed sensing (local search) termed as DECS. Exploiting the inherent sparse nature of gene expression data, the proposed DECS can learn the sparse module dictionaries and levels from the low-dimensional random composite measurements for reconstructing the high-dimensional gene expression data with significant orders of magnitude (e.g., 200x). Several experiments were conducted to compare DECS with three benchmark methods, demonstrating that the proposed DECS outperforms the benchmark methods and can recover most of the gene expression patterns. The underlying reasons are discussed and illustrated by revealing the related mechanistic insights through extensive benchmarks on nine GSE datasets and their sensitivity analysis.

It the second topic, a deep learning framework based on auto-encoder, termed DeepAE, is proposed to elucidate high-dimensional transcriptomic profiling data in an encode-decode manner. Comparative experiments were conducted on nine transcriptomic profiling datasets to compare DeepAE with four benchmark methods. The results demonstrated that the proposed DeepAE outperforms the benchmark methods with robust performance on uncovering the key dimensions of single-cell RNA-seq data. In addition, we also investigate the reconstruction performance of DeepAE in other contexts with different platforms, such as mass cytometry and metabolic profiling in a comprehensive manner. Gene ontology enrichment and pathology analysis are conducted to reveal the mechanisms behind the robust performance of DeepAE by uncovering its key dimensions.