Bioinformatics Analysis and Applications for Developmental Transcriptome Data


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
Award date30 Jan 2020


Developmental Biology is the field of biology that studies the process of the multicellular organisms grow and mature, controlled by their genes. Knowledge of normal developmental processes can aid in the understanding of developmental abnormalities and other conditions such as cancer. The field of developmental biology has more than 500 years of history. There are two fundamental questions in developmental biology: How does the fertilized egg give rise to the adult body, and how does that adult body produce yet another body? In nearly all cases, the development of a multi-cellular organism begins with a single cell—the fertilized egg, or zygote, which divides mitotic ally to produce all the cells of the body. The study of animal development has traditionally been called embryology, from that stage of an organism that exists between fertilization and birth. But development does not stop at birth, or even at adulthood. Most organisms never stop developing. Each day we replace more than a gram of skin cells (the older cells being sloughed off as we move), and our bone marrow sustains the development of millions of new red blood cells every minute of our lives.

Thanks for the modern techniques of molecular biology and genetics, and experimental embryology, the understanding of the developmental mechanisms has grown exponentially. More and more genes (mRNAs, miRNAs, or lncRNAs) have been revealed to be important regulators in animal developmental processes. However, one obvious advantage of experimental molecular biology is the long cycle of finding an important gene. Next-generation sequencing (NGS) technology provides an opportunity to study the expression changes of thousands of genes during different developmental process, such as differentiation, morphogenesis, and reproduction, etc. Except for gene expression levels, transcriptome data can also provide other information including single nuclear variations (SNVs), alternative splicing, and editing level of RNA editing sites. These epitranscriptomics signals are important features to understand the developmental questions. However, the raw data of the NGS are only sequences, a series of software or packages are needed to extract the useful features including gene expression levels, alternative splicing, and SNVs. For the automation of this process, we developed ten modules (one for quality control, two for reads alignment, two for gene expression estimate, two for differential gene expression analysis, one for alternative splicing analysis, one for allele-specific expression analysis, and one for GO and KEGG enrichment analysis) and two pipelines (one for mRNA and one for lncRNA) that can take in the raw data of the NGS and outputs the features we need. We also tested the lncRNA pipeline using the RNA sequencing data of the swine ovary tissue from Yorkshire and Meishan pigs. Using our pipeline, we obtained 3,827 lncRNAs expressed from 4,973 transcripts in swine ovary tissue, of which 510 were ovary-specific, 192 were differentially expressed between Yorkshire and Meishan pigs, and 38 were both ovary-specific and differentially expressed. We also predicted the functions of the lncRNAs by analysing their nearest neighbouring protein-coding genes. Together, our results provide a series of ovary-associated lncRNAs for further experimental investigation of the functions of these genes in ovary development or for selection of candidate genes for breeding. Furthermore, alternative splicing analysis and allele specific analysis modules were also tested using transcriptome data of skeletal muscle across27 prenatal and postnatal stages in Landrace (L, higher muscle mass), Tongcheng (T, lower muscle mass) and hybrid (LT) pig.

As developmental biological processes are dynamic, time-series expression experiments are widely used to study development process and find genes that may play important roles in different stages of development. Recently, various algorithms that are specifically designed for time series experiments have been developed which provide opportunities for researchers to solve problems that are unique to time series expression data. Except for the features extracted using these methods, we also performed analysis using a time-series dataset of the swine skeletal muscle development across 27 time points, including gene expression rate changes during development, time-specific differentially expressed genes between two time-series datasets, specific developmental trajectories during different periods, developmental-related splicing changes and constitutively expressed genes. Using these method we found interesting results including alternative splicing is essentially universal in skeletal muscle and splicing changes followed discrete patterns may play important roles in skeletal muscle development, gene expression changes fastest prenatally and slows gently after birth, 5 of 27 time points tended to have more differentially expressed genes that are related with muscle development process.

Single-cell RNA-seq (scRNA-seq) measures the distribution of expression levels for each gene across a population of cells and provides a more accurate representation of cell-to-cell variations instead of the stochastic average. However, bulk RNA sequencing typically uses hundreds to millions of cells and reveals only the average expression level for each gene across a large population of cell populations. Therefore, scRNA-seq is particularly apposite for developmental biology. To extract features from the scRNA-seq data, we constructed 12 modules (one for quality control, two for reads a alignment, two for feature quantification, two for quality calculation, one for expression imputation, one for normalization, one differential expression analysis, one for subpopulation detection, one for pseudo-time construction) and 2 pipelines (one for samples less than 100 cells and one for samples more than 100 cells) that can take in the raw data and automatically output the useful information on the features. We also constructed a database to investigate single-cell gene expression profiling during different developmental pathways (SCDevDB). In this database, we collected 10 human single-cell RNA-seq data-sets, split these data-sets into 176 developmental cell groups and constructed 24 different developmental pathways. SCDevDB allows users to search the expression profiles of the interested genes across different developmental pathways. It also provides lists of differentially expressed genes during each developmental pathway, t-SNE maps showing the relationships between developmental stages based on these differentially expressed genes, GO and KEGG analysis results of these differentially expressed genes. This database is freely available at