Bioinformatic Approaches for Problems on Genetic Diversity

基因多樣性問題的生物信息學方法

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date20 Aug 2018

Abstract

By the advance of sequencing technique, a number of de novo sequencing projects and population studies have been conducted. The fertile of sequencing data enlighten the flourish of computational and statistical methods in a vast range of areas, including de novo assembly, gene annotation, population analysis and forensic science. Genome diversity in an individual and in population consists vital information for understanding the biological features of the species and its evolutionary history. In my Ph.D. study, I dedicate to develop algorithms and statistic models for understanding and utilizing genome diversity in two aspects: 1. The variation of alleles (heterozygosity) in an individual; 2. The genetic variations among the population.

Heterozygosity is a double-edged sword for biological studies and bioinformatic methods. It reveals the evolutionary insights of the species, but also long plagued genome assembly. For the defect of heterozygosity, we introduce Hank for heterozygosity assimilation through normalized k-mers. The method eliminates heterozygosity from reads by modifying the k-mers in heterozygous regions to increase their coverages to the levels of those in homozygous regions. For the merit of heterozygosity, we utilized it for studying the karyotype of a segmental allohexaploid genome, Carassius auratus gibelio. We estimated the karyotype and the similarity between homologous chromosomes as well as non-homologous chromosomes by analyzing the k-mer spectrum of the Carassius auratus gibelio.

For genetic variations among populations, we developed three software package for scaffolding by linkage disequilibrium, paternity testing with Next Generation Sequencing and pedigree construction from databases respectively. Linkage disequilibrium is the non-random association of alleles at different loci in population. It decays by the physical distance between the pair of loci on the genome within 550-kb region. We propose LDScaf for draft genome scaffolding from population data. LDScaf first construct a complete graph with the vertexes are the scaffold sides and the edges are the linkage values between them. Then it solves the matching problem of the graph to obtain a set of edges between scaffolds that do not share common vertexes. The edges form circles with the edges that indicate the vertexes within scaffolds. We then remove the edges that have the weakest linkage value to transform the circles into linear permutations that indicate the order and orientation of the scaffolds on the target genome. Paternity testing has experienced great changes in the last three decades benefited by the improvement of DNA sequencing technology. The most widely adopted method in forensic laboratories worldwide is PC and CE-based sequencing to detect fragment length variation in 13 core short tandem repeat (STR) markers published by FBI. With a limited number of markers restricted by sequencing technique, this method is vulnerable to false exclusions caused by allelic dropout, null alleles, contamination, human error as well as mutations in offspring. Next Generation Sequencing with its high-throughput and low-cost provide a feasible solution. We propose a novel Bayesian-based statistical method for paternity testing with NGS. The method takes advantage of both SNP markers and STR markers along the whole human genome. Pedigree serves as vital information in forensics applications. With an increasing number of entries in forensic DNA database worldwide, the construction of pedigree in the databases becomes a problem. For this purpose, we propose a method for pedigree search from databases based on belief propagation. Our method converts the pedigree to a factor graph and calculate the genotype likelihood of the unknown individuals. We iteratively search the database and update the factor graph until all the individuals are marked as known or cannot be found in the database.