Instable Region Identification, Better Practical Algorithms for rSPR Distance and Hybridization Number and Finding a Center Tree of Phylogenetic Trees via Leaf Removal
非穩定基因片段的分析研究,計算rSPR距離和雜交數量的實用性算法及基於剪除葉子的尋找進化樹中心樹算法
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 29 Jul 2019 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(c2bab114-81d5-43b5-97f9-0afd8f7bfd29).html |
---|---|
Other link(s) | Links |
Abstract
In this thesis, we study three problems in bioinformatics and computational biology. The first problem is to identify instable regions from bacterial strains. The second problem is to design better practical algorithms for rSPR distance and hybridization number. The third problem is to design fixed parameter algorithms for finding a center tree with minimum total number of leaf removes and finding a center with at most d leaf removes from each of the n input trees.
Pseudomonas aeruginosa is a common bacterium which is recognized for its association with hospital-acquired infections and its advanced antibiotic resistance mechanisms. Tuberculosis, one of the major causes of mortality, is initiated by the deposition of Mycobacterium tuberculosis. Accessory sequences shared by a subset of strains of a species play an important role in a species’ evolution, antibiotic resistance and infectious potential. In this thesis, with a multiple sequence aligner, we segmented 25 Pseudomonas aeruginosa genomes and 28 Mycobacterium tuberculosis genomes into core blocks (include sequences shared by all the input genomes) and dispensable blocks (include sequences shared by a subset of the input genomes), respectively.For each input genome, we then constructed a scaffold consisting of its core and dispensable blocks sorted by blocks’ locations on the chromosomes. Consecutive dispensable blocks on these scaffold formed instable regions. After a comprehensive study of these instable regions, three characteristics of instable regions are summarized: instable regions were short, site specific and varied in different strains. Three DNA elements (directed repeats (DRs), transposons and integrons) were then studied to see whether these DNA elements are associated with the variation of instable regions.
A pipeline was developed to search for DR pairs on the flank of every instable sequence. 27 DR pairs in Pseudomonas aeruginosa strains and 6 pairs in Mycobacterium tuberculosis strains were found to exist in the instable regions. On the average, 14% and 12% of instable regions in Pseudomonas aeruginosa strains covered transposase genes and integrase genes, respectively. In Mycobacterium tuberculosis strains, an average of 43% and 8% of instable regions contain transposase genes and integrase genes, respectively.
Instable regions were short, site specific and varied in different strains for both Pseudomonas aeruginosa and Mycobacterium tuberculosis. Our experimental results showed that directed repeats (DRs), transposons and integrons may be associated with variation of instable regions.
The problem of computing the rSPR distance of two phylogenetic trees is NP-hard and so is the problem of computing the hybridization number of two phylogenetic trees (denoted by HNC). Since they are important problems in phynogenetics, they have been studied extensively in the literature. Indeed, quite a number of exact or approximation algorithms have been designed and implemented for them. In this thesis, we design and implement several approximation algorithms for them and one exact algorithm for HNC. Our experimental results show that the resulting exact program is much faster (namely, more than 80 times faster for the easiest dataset used in the experiments) than the previous best and its superiority in speed becomes even more significant for more difficult instances. Moreover, the resulting approximation programs output much better results than the previous bests; indeed, the outputs are always nearly optimal and often optimal. Of particular interest is the usage of the Monte Carlo tree search (MCTS) method in the design of our approximation algorithms. Our experimental results show that with MCTS, we can often solve HNC exactly within short time.
The problem of computing a center tree from a set of input species phylogenetic trees is also interesting. Here we study the following version:
Given a set T = {T1,T2, ...,Tm} of phylogenetic trees with the same leaf-label set X, we wish to remove some leaves from the trees so that there is a tree T with leaf-label set X displaying all the resulting trees. Note that different leaves may be removed from different input trees. One objective is to minimize the total number of leaves removed from the trees, while the other is to minimize the maximum number of leaves removed from an input tree. Chauve et al. refer to the problem with the first (respectively, second) objective as AST-LR (respectively, AST-LR-d), and show that both problems are NP-hard. They further present algorithms for the parameterized versions of both problems. In this thesis, we point out that their algorithm for the parameterized version of AST-LR is flawed and present a new algorithm. We then show that their algorithm for the parameterized version of AST-LR-d can be sped up by an exponential factor. We further design integer-linear programming (ILP for short) models for AST-LR and AST-LR-d, and discuss speedup issues when using popular ILP solvers (say, GUROBI or CPLEX) to solve the models.
Our experimental results show that our ILP approach is quite efficient.
Pseudomonas aeruginosa is a common bacterium which is recognized for its association with hospital-acquired infections and its advanced antibiotic resistance mechanisms. Tuberculosis, one of the major causes of mortality, is initiated by the deposition of Mycobacterium tuberculosis. Accessory sequences shared by a subset of strains of a species play an important role in a species’ evolution, antibiotic resistance and infectious potential. In this thesis, with a multiple sequence aligner, we segmented 25 Pseudomonas aeruginosa genomes and 28 Mycobacterium tuberculosis genomes into core blocks (include sequences shared by all the input genomes) and dispensable blocks (include sequences shared by a subset of the input genomes), respectively.For each input genome, we then constructed a scaffold consisting of its core and dispensable blocks sorted by blocks’ locations on the chromosomes. Consecutive dispensable blocks on these scaffold formed instable regions. After a comprehensive study of these instable regions, three characteristics of instable regions are summarized: instable regions were short, site specific and varied in different strains. Three DNA elements (directed repeats (DRs), transposons and integrons) were then studied to see whether these DNA elements are associated with the variation of instable regions.
A pipeline was developed to search for DR pairs on the flank of every instable sequence. 27 DR pairs in Pseudomonas aeruginosa strains and 6 pairs in Mycobacterium tuberculosis strains were found to exist in the instable regions. On the average, 14% and 12% of instable regions in Pseudomonas aeruginosa strains covered transposase genes and integrase genes, respectively. In Mycobacterium tuberculosis strains, an average of 43% and 8% of instable regions contain transposase genes and integrase genes, respectively.
Instable regions were short, site specific and varied in different strains for both Pseudomonas aeruginosa and Mycobacterium tuberculosis. Our experimental results showed that directed repeats (DRs), transposons and integrons may be associated with variation of instable regions.
The problem of computing the rSPR distance of two phylogenetic trees is NP-hard and so is the problem of computing the hybridization number of two phylogenetic trees (denoted by HNC). Since they are important problems in phynogenetics, they have been studied extensively in the literature. Indeed, quite a number of exact or approximation algorithms have been designed and implemented for them. In this thesis, we design and implement several approximation algorithms for them and one exact algorithm for HNC. Our experimental results show that the resulting exact program is much faster (namely, more than 80 times faster for the easiest dataset used in the experiments) than the previous best and its superiority in speed becomes even more significant for more difficult instances. Moreover, the resulting approximation programs output much better results than the previous bests; indeed, the outputs are always nearly optimal and often optimal. Of particular interest is the usage of the Monte Carlo tree search (MCTS) method in the design of our approximation algorithms. Our experimental results show that with MCTS, we can often solve HNC exactly within short time.
The problem of computing a center tree from a set of input species phylogenetic trees is also interesting. Here we study the following version:
Given a set T = {T1,T2, ...,Tm} of phylogenetic trees with the same leaf-label set X, we wish to remove some leaves from the trees so that there is a tree T with leaf-label set X displaying all the resulting trees. Note that different leaves may be removed from different input trees. One objective is to minimize the total number of leaves removed from the trees, while the other is to minimize the maximum number of leaves removed from an input tree. Chauve et al. refer to the problem with the first (respectively, second) objective as AST-LR (respectively, AST-LR-d), and show that both problems are NP-hard. They further present algorithms for the parameterized versions of both problems. In this thesis, we point out that their algorithm for the parameterized version of AST-LR is flawed and present a new algorithm. We then show that their algorithm for the parameterized version of AST-LR-d can be sped up by an exponential factor. We further design integer-linear programming (ILP for short) models for AST-LR and AST-LR-d, and discuss speedup issues when using popular ILP solvers (say, GUROBI or CPLEX) to solve the models.
Our experimental results show that our ILP approach is quite efficient.