Phylogenetic tree reconciliation, haplotype assembly and blocked pattern matching
進化樹和解與單體型組裝及分塊模式匹配
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 3 Oct 2014 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(02539b7f-d310-496b-9727-65452569b6a5).html |
---|---|
Other link(s) | Links |
Abstract
In this thesis, we study several important problems arising in computational biology
and bioinformatics. These problems are phylogenetic tree reconciliation, haplotype
assembly and blocked pattern matching.
Phylogenetic trees have long been used to describe the evolutionary history of a set
of species. Generally, the phylogenetic tree built from a single gene family (referred
to as the gene tree) may disagree with the phylogenetic tree of the underlying species
(referred to as the species tree) in the presence of evolutionary events, such as gene
duplications, gene losses, and LGTs. Tree reconciliation that maps the vertices of the
gene tree to the vertices of the species tree is a commonly used model for explaining
the discrepancies between a gene tree and a species tree. When computing a reconciliation
between a gene tree and a species tree, most of the previous studies consider
only a proper subset of the three events: gene duplications, gene losses, and LGTs. In
this thesis, we propose a fixed-parameter algorithm for the problem of enumerating all
minimum-cost LCA-reconciliations involving all the three events. Moreover, our algorithm
can work for the weighted version of the problem, where the costs of a gene
duplication, a gene loss, and an LGT are left to the user's discretion.
For the case where only gene duplications and lateral gene transfers (LGTs) are considered,
the previous best algorithm for enumerating all minimum-cost LCA-reconciliations
runs in O(m+3kn) time, where m is the number of vertices in S, n is the number of vertices
in G, and k is the minimum cost of an LCA-reconciliation between S and G. In
this thesis, we also consider the same problem and propose an improved algorithm that runs in O(mn2+3k) time.
Haplotypes play a crucial role in genetic analysis and have many applications such
as gene disease diagnoses, association studies, ancestry inference, etc. With the development
of DNA sequencing technologies, it is possible to obtain haplotypes from
a set of aligned reads originated from both copies of a chromosome of a single individual.
This approach is often known as haplotype assembly. Exact algorithms that
can give optimal solutions to the haplotype assembly problem are highly demanded.
Unfortunately, previous algorithms for this problem either fail to output optimal solutions
or take too long time even executed on a PC cluster. In this thesis, we develop
an approach to finding optimal solutions for the haplotype assembly problem under
the minimum-error-correction (MEC) model. Most of the previous approaches assume
that the columns in the input matrix correspond to (putative) heterozygous sites. This
all-heterozygous assumption is correct for most columns, but it may be incorrect for a
small number of columns. In this thesis, we consider the MEC model with or without
the all-heterozygous assumption. In our approach, we first use new methods to decompose
the input read matrix into small independent blocks and then model the problem
for each block as an integer linear programming (ILP) problem.
Tandem mass spectrometry has become the method of choice for protein identification
and quantification. In the era of big data biology, tandem mass spectra are often
searched against very large protein databases generated from genomes or RNA-Seq
data for peptide identification. However, most existing tools for mass spectrometrybased
peptide identification compute the similarity scores between every tandem mass
spectrum and many peptides in the database, making spectral data analysis slow for
very large databases. Tag-based methods use a set of tags extracted from a tandem
mass spectrum as a filter to reduce the number of candidate peptides, thus speeding
up the database search. Recently, blocked patterns (gapped tags) have been introduced
because they can improve the sensitivity of peptide identification by database search
compared with ordinary tags. However, the blocked pattern matching problem, which is an essential step in blocked pattern-based peptide identification, has not been fully
solved. In this thesis, we propose a fast and memory-efficient algorithm for the blocked
pattern matching problem. Experiments on both simulated and real data sets showed
that the proposed algorithm achieved both high speed and sensitivity for peptide filtration
in peptide identification by database search.
- Phylogeny, Single Nucleotide Polymorphism, Computational biology, Pattern recognition systems