Pseudo-periodic repeats, bi-clustering algorithms and quasi-bicliques
擬重複序列, 雙聚類算法和準完全二分圖
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 15 Feb 2008 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(4837339a-dbc5-4ef0-a858-e32f1d5fe3a3).html |
---|---|
Other link(s) | Links |
Abstract
In this thesis, we study several important problems in computation biology and
bioinformatics. These problems are the pseudo-periodic repeat discovery problem,
the bi-clustering problem and the quasi-biclique problem.
The genomes of many species are dominated by short sequences repeated consecutively.
It is estimated that over 10% of the human genome consists of tandemly
repeated sequences. Finding repeated regions in long sequences is important in
sequence analysis. In Chapter 2, we develop a software, LocRepeat, that finds
regions of pseudo-periodic repeats in a long sequence. We use the definition of the
pseudo-periodic partition of a region and design an algorithm that can select the
repeated region from a given long sequence and give the pseudo-periodic partition
of the region.
One of the main goals in the analysis of microarray data is to identify groups
of genes and groups of experimental conditions (including environments, individuals
and tissues) that exhibit similar expression patterns. This is the so-called
bi-clustering problem. In Chapter 3, we consider two variants of the bi-clustering
problem: the consensus sub-matrix problem and the bottleneck sub-matrix problem.
The input of the problems contains an m× n matrix A and integers l and k. The consensus sub-matrix problem is to find an l × k sub-matrix with l < m and
k < n and a consensus vector such that the sum of distances between the rows in
the sub-matrix and the consensus vector is minimized. The bottleneck sub-matrix
problem is to find an l × k sub-matrix with l < m and k < n, an integer d and
a center vector such that the distance between every row in the sub-matrix and
the vector is at most d and d is minimized. We show that both problems are NPhard
and give randomized approximation algorithms for special cases of the two
problems. Using standard techniques, we can derandomize the algorithms to get
polynomial time approximation schemes for the two problems. To our knowledge,
this is the first time that approximation algorithms with guaranteed ratios are
presented for the bi-clustering problem.
We have another strike on the bi-clustering problem in Chapter 4. We define
the maximum similarity score for a bi-cluster and design a polynomial time algorithm
to find an optimal bi-cluster with the maximum similarity score. To our
knowledge, this is the first formulation for bi-clustering problems that admits a
polynomial time exact algorithm. The algorithm works for a special case, where
the bi-clusters are approximately squares. We then extend the algorithm to handle
various kinds of other cases. Experiments on simulated data and real data show
that the new algorithms outperform most of the existing methods in many cases.
Our new algorithms have the following advantages: (1) no discretization procedure
is required, (2) performs well for overlapping bi-clusters, and (3) works well
for additive bi-clusters. Protein-protein interactions (PPIs) are one of the most important mechanisms
in cellular processes. To model protein interaction sites, recent studies have suggested
to find interacting protein group pairs from large PPI networks at the first
step, and then to search conserved motifs within the protein groups to form interacting
motif pairs. To consider noise effect and incompleteness of biological data,
we propose to use quasi-bicliques for finding interacting protein group pairs. In
Chapter 5, we investigate two new problems which arise from finding interacting
protein group pairs: the maximum vertex quasi-biclique problem and the maximum
balanced quasi-biclique problem. We prove that both problems are NP-hard.
This is a surprising result as the widely known maximum vertex biclique problem
is polynomial time solvable [75]. We then propose a heuristic algorithm which uses
the greedy method to find the quasi-bicliques from PPI networks. Our experimental
results on real data show that this algorithm has a better performance than a
benchmark algorithm for identifying highly matched BLOCKS and PRINTS motifs.
We also report results of two case studies on interacting motif pairs which
map well with two interacting domain pairs in iPfam.
- Computational biology, Bioinformatics