Methods and Applications for Single Cell Sequencing Data and RNA Editing Data Analysis

針對單細胞測序數據及RNA編輯數據分析的方法和應用

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date23 Jun 2020

Abstract

Single-cell sequencing is becoming indispensable in studying the landscapes of cell-specific genomes or transcriptomes. Single-cell sequencing data has exploded in the past several years which requires new computational methods and software to be developed for single-cell sequencing data analysis. Here, I dedicate myself to solve two problems: the single-cell DNA sequencing (scDNA-seq) copy number variation and the single-cell RNA sequencing (scRNA-seq) imputation.

Copy number variation is crucial in deciphering the mechanism and cure of complex disorders and cancers. The recent advancement of scDNA sequencing technology sheds light upon addressing intratumor heterogeneity, detecting rare subclones, and reconstructing tumor evolution lineages at single-cell resolution. Nevertheless, the current circular binary segmentation based approach proves to fail to efficiently and effectively identify copy number shifts on some exceptional trails. Here, we propose SCYN, a CNV segmentation method powered with dynamic programming. SCYN resolves the precise segmentation of one in silico dataset. Then we verified SCYN manifested accurate copy number inferring on triple-negative breast cancer scDNA data, with array comparative genomic hybridization results of purified bulk samples as ground truth validation. We tested SCYN on two datasets of the newly emerged 10x Genomics CNV solution. SCYN successfully recognizes gastric cancer cells from 1% and 10% spike-ins 10x datasets. Moreover, SCYN is about 150 times faster than the state of the art tool when dealing with the datasets of approximately 2000 cells. SCYN robustly and efficiently detects segmentations and infers copy number profiles on single-cell DNA sequencing data. It serves to reveal the tumor intra-heterogeneity.

In scRNA-seq techniques, only a small fraction of the genes are captured due to "dropout" events. These dropout events require intensive treatment when analyzing scRNA-seq data. For example, imputation tools have been proposed to estimate dropout events and de-noise data. The performance of these imputation tools is often evaluated, or fine-tuned, using various clustering criteria based on ground-truth cell subgroup labels. This limits their effectiveness in cases where we lack cell subgroup knowledge. We consider an alternative strategy which requires the imputation to follow a "self-consistency" principle; that is, the imputation process is to refine its results until there is no internal inconsistency or dropouts from the data. Here, we propose the use of "self-consistency" as the main criterion in performing imputation. To demonstrate this principle we proposed I-Impute, a "self-consistent" method, to impute scRNA-seq data. I-Impute optimizes continuous similarities and dropout probabilities, in iterative refinements until a self-consistent imputation is reached. On the in silico data sets, I-Impute exhibited the highest Pearson correlations for different dropout rates consistently compared with the state-of-art methods SAVER and scImpute. Furthermore, we collected three wetlab datasets, mouse bladder cells dataset, embryonic stem cells dataset, and aortic leukocyte cells dataset, to evaluate the tools. I-Impute exhibited feasible cell subpopulation discovery efficacy on all the three datasets. It achieves the highest clustering accuracy compared with other state-of-the-art imputation tools.

RNA editing is a post-transcriptional molecular process that increases the diversity of transcriptome by changing nucleotides of RNAs. By the advance of high-throughput sequencing techniques, sufficient pipelines and software packages for the identification of RNA editing sites have been developed in the last decade. Using these pipelines and software, over millions of RNA editing sites were identified among different species. However, it remains a great challenge to analyze these RNA editing sites data due to lacking tools for comprehensive analysis of RNA editing sites data. Moreover, although the prevalence and importance of RNA editing have been illuminated in mammals, current RNA editing studies have principally concentrated on humans, rodents, and other primates. The profiling of RNA editing sites in many important farm animals, like Sus scrofa, sheep, and cow, has not been reported.

To address the above problems in the RNA editing field, I developed a user-friendly public webtool named MIRIA that integrates statistics and visualization techniques to facilitate the comprehensive analysis of RNA editing sites data identified by the pipelines and software packages. MIRIA is unique in that it provides several analytical functions, including RNA editing type statistics, genomic feature annotations, editing level statistics, genome-wide distribution of RNA editing sites, tissue-specific analysis, and conservation analysis. Furthermore, we conducted the first genome-wide investigation and functional analysis of Sus scrofa RNA editing sites across eleven tissues by using MIRIA. We identified more than 490,000 Sus scrofa RNA editing sites and annotated their biological features, detected flank sequence characteristics of A-to-I editing sites, and the impact of A-to-I editing on miRNA–mRNA interactions, and identified RNA editing quantitative trait loci (edQTL). Sus scrofa RNA editing sites showed high enrichment in repetitive regions with a median editing level of 15.38%. Expectedly, 96.3% of the editing sites located in non-coding regions including intron, 3′ UTRs, intergenic, and gene proximal regions. There were 2233 editing sites located in the coding regions and 980 of them caused missense mutation. Our results indicated that to an A-to-I editing site, the adjacent four nucleotides, two before it and two after it, have a high impact on the editing occurrences. A commonly observed editing motif is CCAGG. We found that 4552 A-to-I RNA editing sites could disturb the original binding efficiencies of miRNAs and 4176 A-to-I RNA editing sites created new potential miRNA target sites. Besides, we performed edQTL analysis and found that 1134 edQTLs that significantly affected the editing levels of 137 RNA editing sites. Finally, we constructed PRESDB, the first pig RNA editing sites database. The site provides necessary functions associated with Sus scrofa RNA editing study.