Pseudo-periodic repeats, bi-clustering algorithms and quasi-bicliques

擬重複序列, 雙聚類算法和準完全二分圖

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

  • Xiaowen LIU

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date15 Feb 2008

Abstract

In this thesis, we study several important problems in computation biology and bioinformatics. These problems are the pseudo-periodic repeat discovery problem, the bi-clustering problem and the quasi-biclique problem. The genomes of many species are dominated by short sequences repeated consecutively. It is estimated that over 10% of the human genome consists of tandemly repeated sequences. Finding repeated regions in long sequences is important in sequence analysis. In Chapter 2, we develop a software, LocRepeat, that finds regions of pseudo-periodic repeats in a long sequence. We use the definition of the pseudo-periodic partition of a region and design an algorithm that can select the repeated region from a given long sequence and give the pseudo-periodic partition of the region. One of the main goals in the analysis of microarray data is to identify groups of genes and groups of experimental conditions (including environments, individuals and tissues) that exhibit similar expression patterns. This is the so-called bi-clustering problem. In Chapter 3, we consider two variants of the bi-clustering problem: the consensus sub-matrix problem and the bottleneck sub-matrix problem. The input of the problems contains an m× n matrix A and integers l and k. The consensus sub-matrix problem is to find an l × k sub-matrix with l < m and k < n and a consensus vector such that the sum of distances between the rows in the sub-matrix and the consensus vector is minimized. The bottleneck sub-matrix problem is to find an l × k sub-matrix with l < m and k < n, an integer d and a center vector such that the distance between every row in the sub-matrix and the vector is at most d and d is minimized. We show that both problems are NPhard and give randomized approximation algorithms for special cases of the two problems. Using standard techniques, we can derandomize the algorithms to get polynomial time approximation schemes for the two problems. To our knowledge, this is the first time that approximation algorithms with guaranteed ratios are presented for the bi-clustering problem. We have another strike on the bi-clustering problem in Chapter 4. We define the maximum similarity score for a bi-cluster and design a polynomial time algorithm to find an optimal bi-cluster with the maximum similarity score. To our knowledge, this is the first formulation for bi-clustering problems that admits a polynomial time exact algorithm. The algorithm works for a special case, where the bi-clusters are approximately squares. We then extend the algorithm to handle various kinds of other cases. Experiments on simulated data and real data show that the new algorithms outperform most of the existing methods in many cases. Our new algorithms have the following advantages: (1) no discretization procedure is required, (2) performs well for overlapping bi-clusters, and (3) works well for additive bi-clusters. Protein-protein interactions (PPIs) are one of the most important mechanisms in cellular processes. To model protein interaction sites, recent studies have suggested to find interacting protein group pairs from large PPI networks at the first step, and then to search conserved motifs within the protein groups to form interacting motif pairs. To consider noise effect and incompleteness of biological data, we propose to use quasi-bicliques for finding interacting protein group pairs. In Chapter 5, we investigate two new problems which arise from finding interacting protein group pairs: the maximum vertex quasi-biclique problem and the maximum balanced quasi-biclique problem. We prove that both problems are NP-hard. This is a surprising result as the widely known maximum vertex biclique problem is polynomial time solvable [75]. We then propose a heuristic algorithm which uses the greedy method to find the quasi-bicliques from PPI networks. Our experimental results on real data show that this algorithm has a better performance than a benchmark algorithm for identifying highly matched BLOCKS and PRINTS motifs. We also report results of two case studies on interacting motif pairs which map well with two interacting domain pairs in iPfam.

    Research areas

  • Computational biology, Bioinformatics