Phylogenetic tree reconciliation, haplotype assembly and blocked pattern matching

進化樹和解與單體型組裝及分塊模式匹配

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

  • Fei DENG

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date3 Oct 2014

Abstract

In this thesis, we study several important problems arising in computational biology and bioinformatics. These problems are phylogenetic tree reconciliation, haplotype assembly and blocked pattern matching. Phylogenetic trees have long been used to describe the evolutionary history of a set of species. Generally, the phylogenetic tree built from a single gene family (referred to as the gene tree) may disagree with the phylogenetic tree of the underlying species (referred to as the species tree) in the presence of evolutionary events, such as gene duplications, gene losses, and LGTs. Tree reconciliation that maps the vertices of the gene tree to the vertices of the species tree is a commonly used model for explaining the discrepancies between a gene tree and a species tree. When computing a reconciliation between a gene tree and a species tree, most of the previous studies consider only a proper subset of the three events: gene duplications, gene losses, and LGTs. In this thesis, we propose a fixed-parameter algorithm for the problem of enumerating all minimum-cost LCA-reconciliations involving all the three events. Moreover, our algorithm can work for the weighted version of the problem, where the costs of a gene duplication, a gene loss, and an LGT are left to the user's discretion. For the case where only gene duplications and lateral gene transfers (LGTs) are considered, the previous best algorithm for enumerating all minimum-cost LCA-reconciliations runs in O(m+3kn) time, where m is the number of vertices in S, n is the number of vertices in G, and k is the minimum cost of an LCA-reconciliation between S and G. In this thesis, we also consider the same problem and propose an improved algorithm that runs in O(mn2+3k) time. Haplotypes play a crucial role in genetic analysis and have many applications such as gene disease diagnoses, association studies, ancestry inference, etc. With the development of DNA sequencing technologies, it is possible to obtain haplotypes from a set of aligned reads originated from both copies of a chromosome of a single individual. This approach is often known as haplotype assembly. Exact algorithms that can give optimal solutions to the haplotype assembly problem are highly demanded. Unfortunately, previous algorithms for this problem either fail to output optimal solutions or take too long time even executed on a PC cluster. In this thesis, we develop an approach to finding optimal solutions for the haplotype assembly problem under the minimum-error-correction (MEC) model. Most of the previous approaches assume that the columns in the input matrix correspond to (putative) heterozygous sites. This all-heterozygous assumption is correct for most columns, but it may be incorrect for a small number of columns. In this thesis, we consider the MEC model with or without the all-heterozygous assumption. In our approach, we first use new methods to decompose the input read matrix into small independent blocks and then model the problem for each block as an integer linear programming (ILP) problem. Tandem mass spectrometry has become the method of choice for protein identification and quantification. In the era of big data biology, tandem mass spectra are often searched against very large protein databases generated from genomes or RNA-Seq data for peptide identification. However, most existing tools for mass spectrometrybased peptide identification compute the similarity scores between every tandem mass spectrum and many peptides in the database, making spectral data analysis slow for very large databases. Tag-based methods use a set of tags extracted from a tandem mass spectrum as a filter to reduce the number of candidate peptides, thus speeding up the database search. Recently, blocked patterns (gapped tags) have been introduced because they can improve the sensitivity of peptide identification by database search compared with ordinary tags. However, the blocked pattern matching problem, which is an essential step in blocked pattern-based peptide identification, has not been fully solved. In this thesis, we propose a fast and memory-efficient algorithm for the blocked pattern matching problem. Experiments on both simulated and real data sets showed that the proposed algorithm achieved both high speed and sensitivity for peptide filtration in peptide identification by database search.

    Research areas

  • Phylogeny, Single Nucleotide Polymorphism, Computational biology, Pattern recognition systems