Algorithm and Platform for Cancer Single Cell DNA


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date23 Sept 2020


Many studies have observed that intra-tumor heterogeneity is one of the principal causes of cancer therapy-resistant, tumor recurrence, and death. An accurate understanding of the subclone structure and evolutionary history benefits to precise treatments for individual patients. Evolutionary processes shape the heterogeneity of tumors. The heterogeneity can be recognized from diverse aspects in DNA level, including single nucleotide variation, microsatellite instability, copy number variations, structure variations, complex rearrangements, and chromosome instability.

With the explosion of single-cell DNA sequencing in cancer analysis, it becomes increasingly crucial to accurately calling the somatic variants in single-cell resolution. Compared with the traditional bulk sequencing method, the single cell sequencing exhibits several advantages. Although the former studies have contributed to tumor biology’s perspicacity, it is restricted to offering the mixed signals of tumor cells or clones which hold genotype diversity, leading to the mask of intra-tumor heterogeneity. For instance, if the averaged read-out overrepresents the genomic data from the prevailing cluster of the tumor cells, rare subclones will be veiled from the signals. Many deconvolution computational tools have been implemented, while they can only afford a rough guess of the mixture of cell types or subclones. The coming of single cell genomic sequencing addresses this concern soundly. Researchers can conquer the obstacles of bulk profiling to mark intratumor heterogeneity, distinguish tumor subclones, build clonal lineage and metastasis, and infer the cause of therapeutic resistance.

During my Ph.D. study, I dedicate myself for intra-tumor heterogeneity analyses on single cell DNA sequencing data in three aspects: (1) applying dimension reduction techniques to decipher the tumor clonal substructure; (2) building the clonal lineage within a tumor; (3) resolving the complex structure variations across subclones.

First, we focus on resolving the hidden structure of single cells in scDNA copy number profiles. With recent advances in high-throughput technologies, matrix factorization techniques are increasingly being utilized for mapping quantitative omics profiling matrix data into low-dimensional embedding space, in the hope of uncovering insights in the underlying biological processes. Nevertheless, current matrix factorization tools fall short in handling noisy data and missing entries, both deficiencies that are often found in real-life data. We propose a novel deep neural network-based matrix factorization framework, DeepMF, which separately maps molecular features and single cells into low-dimensional latent space, tolerant with noisy and missing entries. We demonstrate that DeepMF is robust in denoising, imputation, and embedding with in silico instances. We then collected four wet lab datasets, medulloblastoma cancer, leukemia cancer, breast cancer, and small-blue-round-cell cancer datasets, as benchmark sets to evaluate the tools. DeepMF outperformed the existing MF tools on cancer subtype discovery in omics profiles of the four benchmark datasets, with the highest clustering accuracy on all the four datasets. Furthermore, with 70% data randomly removed, DeepMF demonstrated the best recovery capacity with silhouette values 0.47, 0.6, 0.28, and 0.44. It also displayed the best embedding power on the four sparse benchmark sets, with clustering accuracy of respectively 88%, 100%, 84%, and 96%, which improves on the current best methods 76%, 100%, 78%, and 87%. In respective of single cell data, we simulated one scDNA CNV profile with random noise as benchmark set. DeepMF displayed the best denoising and embedding power across existing tools.

Second, evolutionary processes within a tumor can be presented through a phylogenetic tree. The variations identified in all cancer cells will be regarded as the tree’s backbone, and the subclonal mutations that only exist in a portion of the cancer cells constitute the branch. With the aid of the subclonal prevalence information, computational tools can infer the subclonal hierarchy and time distance of tumor phylogeny. Nevertheless, computational challenges arise when genomics profiles reach single-cell resolution. The existing tools exhibit its deficiency when dealing with today’s high-throughput technologies, which yields thousands of single cells at one time. We develop a tool to effectively constructs the life history of single cells. Then, we utilize this tool to investigate three lung cancer cases and discussed the potential tumor evolution process underlying the sequencing reads.

Third, consisting of deletions, duplication, translocation, and inversion, structure variation is the genomic aberration that larger than 50 bp, which is considered as the main reason for tumor progression. Advances in next-generation sequencing and long-read sequencing technologies have suggested that instead of independent occurrence, a group of structure variations may arise by the following circumstances and ends in complex rearrangement or complex structure variations. Several computational tools have been developed to detect complex rearrangement, including chromothripsis, chromoplexy, virus integration, L1-mediated rearrangement, breakage-fusion-bridge cycle, microhomology-mediated break-induced replication, and extrachromosomal circular DNA. However, these methods do not resolve the rearrangement haplotype for a given complex event. Furthermore, these methods are designed for bulk sequencing and unable to decipher the structure variation intra-heterogeneity at single cell level. We develop an analysis suite to classify adjacent structure variation events, recognize the complex rearrangement type, and reconstruct the rearrangement’s local haplotype at single-cell resolution. We illustrate the robustness of our tool in several in silico instances. We also demonstrate the heterogeneity of complex rearrangement discovered in one small lung cancer LC003T using scDNA-seq data.

In all, cancer is a complex disease and the second leading cause of death in humans. The intra-tumor heterogeneity is one of the principal causes of cancer therapy-resistant, tumor recurrence, and death. The recent advance of single-cell genomic analysis plays an essential role in addressing intratumor heterogeneity, identifying tumor subgroups, and restoring tumor evolving trajectories at unprecedented resolution. This allows us to have a comprehensive understanding of cancer evolution, shed light to cure cancers, and pave a solid step for the precise treatment of cancer. Herein, we present scDNA-seq Somatic Variant Analysis Suite (scSVAS), an ensemble tool for large scale single cell DNA somatic variant analysis. It currently supports three functionalities, copy number profile embedding, clonal lineage analysis, and complex structure variation analysis. Likewise, we developed an online platform Oviz-SingleCell ( for aesthetically-pleasing, real-time interactive, and user-friendly visualization. After uploading the required upstream results from scSVAS, users may make scientific discoveries and share interactive visualization and download high-quality publication-ready figures.