Mutation region detection and haplotype assembly
基因突變區域檢測及單體型組裝
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 2 Oct 2013 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(209dd855-46a8-4fb5-a7a1-840b134c3857).html |
---|---|
Other link(s) | Links |
Abstract
In this thesis, we study two important problems in computational biology and bioinformatics,
mutation region detection and haplotype assembly.
Linkage analysis is the first step in the search for a disease gene. Linkage studies
have facilitated the identification of several hundred human genes that can harbor mutations
leading to a disease phenotype. In Chapter 2, we study a very important case,
where the sampled individuals are closely related, but the pedigree is not given. This
situation happens very often when the individuals share a common ancestor 6 or more
generations ago. To our knowledge, no algorithm can give good results for this case.
To solve this problem, we first developed some heuristic algorithms for haplotype inference
without any given pedigree. We propose a model using the parsimony principle
that can be viewed as an extension of the model first proposed by Dan Gusfield. Our
heuristic algorithm uses Clark's inference rule to infer haplotype segments. We ran our
program both on the simulated data and a set of real data from the phase II HapMap
database. Experiments show that our program performs well. The recall value is from
90% to 99% in various cases. This implies that the program can report more than 90%
of the true mutation regions. The value of precision varies from 29% to 90%. When the
precision is 29%, the size of the reported regions is three times that of the true mutation
region. This is still very useful for narrowing down the range of the disease gene location.
Our program can complete the computation for all the tested cases, where there
are about 110,000 SNPs on a chromosome, within 20 seconds.
For mutation region detection problem, there is another version where the pedigree
is not given and a database of confirmed haplotypes is given as reference instead. To
infer the allele-sharing status of a set of individuals with the database of confirmed haplotypes, the shared center(SC) problem was proposed. The closest shared center
(CSC) problem, which is the minimization version of the SC problem, is a core to solve
the mutation region detection problem in this case. In chapter 3, we design a polynomial
time approximation scheme for the CSC problem, i.e., for any fixed number ϵ ∈(0, 1),
our algorithm can find a feasible solution with objective value no more than (1+ϵ)d*,
where d* is the optimal objective value of the input instance for the CSC problem.
Single nucleotide polymorphisms (SNPs) are the most common form of genetic
variation in human DNA. The sequence of SNPs in each of the two copies of a given
chromosome in a diploid organism is referred to as a haplotype. Haplotype information
has many applications such as gene disease diagnoses, drug design, etc. The haplotype
assembly problem is defined as follows: Given a set of fragments sequenced from
the two copies of a chromosome of a single individual, and their locations in the chromosome,
which can be pre-determined by aligning the fragments to a reference DNA
sequence, the goal here is to reconstruct two haplotypes (h1; h2) from the input fragments.
Existing algorithms do not work well when the error rate of fragments is high.
In chapter 4, we design an algorithm that can give accurate solutions, even if the error
rate of fragments is high. We first give a dynamic programming algorithm that can give
exact solutions to the haplotype assembly problem. The time complexity of the algorithm
is O(n x 2t x t), where n is the number of SNPs, and t is the maximum coverage
of a SNP site. The algorithm is slow when t is large. To solve the problem when t is
large, we further propose a heuristic algorithm on the basis of the dynamic programming
algorithm. Experiments show that our heuristic algorithm can give very accurate
solutions. We have tested our algorithm on a set of benchmark datasets. Experiments
show that our algorithm can give very accurate solutions. It outperforms most of the
existing programs when the error rate of the input fragments is high.
- Mutation (Biology), Bioinformatics, Computational biology, Single nucleotide polymorphisms, Mathematical models