Skip to main navigation Skip to search Skip to main content

Extraction and matching of coherent patterns in biomolecular data

  • Zhiguan WANG

Student thesis: Doctoral Thesis

Abstract

With the fast development of molecular biology, large amounts of biomolecular data can be generated from the Human Genome Project, calling for the development of novel algorithms for data analysis and opening up new areas for applications of many pattern recognition techniques. In this thesis, we propose a number of new graph theory-based biclustering algorithms for coherent pattern extraction in microarray data. Besides, we also develop a design for a novel protein-protein docking algorithm with a probabilistic relaxation labeling scheme for intermolecular hydrogen bond matching. Biclustering is a non-supervised approach that performs simultaneous clustering on both rows and columns of a matrix. The resulting biclusters are sub-matrices of the original full data matrix that contain some special coherent structures. Biclustering has a wide range of applications in different areas such as bioinformatics, image retrieval and text file recognition. In geometric based biclustering, different bicluster patterns in a matrix can be seen as linear objects with special structures in high dimensional spaces. Therefore we can search for linear objects in a multi-dimensional space to find bicluster patterns in a matrix. To overcome the disadvantage of the high computational complexity of this method in multi-dimensional space, the sub-dimension based method is introduced for biclustering. The matrix with embedded bicluster patterns can be split into column pairs. We use a modified Hough transform in 2-D spaces for sub-bicluster extraction. The main contribution of the work undertaken in this part is that we utilize the graph spectra theory to speed up the sub-bicluster combination process. After the modified Hough transform for sub-bicluster detection, we build a hypergraph with the sub-biclusters representing nodes as well as the common elements between nodes representing hyperedges. The hypergraph partitioning algorithm based on the multilevel paradigm is utilized to cut the hypergraph so that the nodes in the same sub-hypergraph are able to share more hyperedges. As a result, the expansion process can be optimized through combining the sub-biclusters in each sub-graph. However, according to the hypergraph partitioning scheme, a proper value of the k part should be set in order to get the partitions. Besides, taking all the sub-biclusters into consideration for combining the sub-biclustering in each sub-graph may also contribute to the large computation complexity. To overcome the above shortcomings of the hypergraph based combination process, we build a graph with each sub-bicluster as a node instead. The irrelevant sub-bicluster can be filtered out by the edge pruning process. Through computing the eigenvectors of the graph adjacency matrix, the eigengroup representing the dense part of the graph can be obtained. Experiment results show that combining the sub-bicluster in each eigengroup can efficiently improve the biclustering speed. The biological experiments also show that the proposed algorithm can detect meaningful gene biclusters from the microarray data with good results. The biclustering technique which aims to extract coherent patterns in matrices offers an efficient way for large gene expression data analysis. In this work, we also study the pattern matching task and propose a probabilistic relaxation labeling based protein-protein docking method. Protein-protein docking is a molecular modeling technique that aims to predict the position and orientation of a ligand when it is bound to a protein receptor. The 3D structure of protein-protein complexes provides a valuable understanding of the function of molecular systems. Thus the protein-protein docking task has a lot of applications which may be related to inhibitor design, cellular pathways prediction, macromolecular interactions and macromolecular assemblies. In this work, we propose a new rigid protein-protein docking algorithm based on a relaxation labeling scheme for searching the intermolecular hydrogen bonds between receptors and ligands. In the matching process, the receptors and ligands can be seen as two sets of points composed of hydrogen bond donors and acceptors. Therefore, the protein-protein matching scheme can be transformed into a point matching task. The donors and acceptors on the receptors and ligands refer to two sets of potential matching points, and the relaxation labeling method can select the best matching pairs and reduce the ambiguity and noise effect. To estimate the initial probability, the patch histogram is introduced for local shape matching and we also consider the geometric constraint in order to obtain the conditional probability matrix. After several iterations, the assignment probabilities converge to a binary indicator vector indicating the possible matching pairs. Thus after the matching process, only a very small subset of hydrogen bonding modes is retained for ranking with the score function, which may reduce the computation time. The relaxation labeling based docking method is demonstrated to be efficient and accurate by the experiments carried out on the rigid protein-protein dataset.
Date of Award2 Oct 2013
Original languageEnglish
Awarding Institution
  • City University of Hong Kong
SupervisorHong YAN (Supervisor)

Keywords

  • Molecular biology
  • Data processing

Cite this

'