Skip to main navigation Skip to search Skip to main content

Regression Based Global Similarity Search Algorithms for Biological Networks

Student thesis: Doctoral Thesis

Abstract

High-throughput biotechniques have been applied to generate a significant amount of biological networks in the post-genomics era. Biological network analysis is a cornerstone of system biology. In particular, global network similarity search (NSS) plays a vital role in establishing structural, functional, and evolutionary relationships between different biological entities. Given a query network and a network database, the NSS aims to identify novel networks that are homologous to the query network based on a similarity measure of interest. Unfortunately, the NSS is computationally inefficient due to the NP-complete graph isomorphism problem. Therefore, the design of NSS algorithm remains challenging under the dilemma between accuracy and efficiency. In this thesis, I propose and present three novel methods for global similarity search of biological networks.

The first method is a global NSS method based on regression, denotated as NSSRF, which boosts the search speed without any significant sacrifice in practical performance. As motivated from the nature, subgraph signatures are heavily involved. Two phases are proposed in NSSRF: offline model building phase and similarity query phase. In the offline model building phase, the subgraph signatures and cosine similarity scores are used for efficient random forest regression (RFR) model training. In the similarity query phase, the trained regression model is queried to return similar networks. We have extensively validated NSSRF on biological pathways and molecular structures; NSSRF demonstrates competitive performance over the state-of-the-arts. Remarkably, NSSRF works especially well for large networks, which indicates that the proposed approach can be promising in the era of big data. Case studies have proven the efficiencies and uniqueness of NSSRF which could be missed by the existing state-of-the-arts.

The second method is a global pathway similarity search approach, called ToBio, which extends NSSRF by incorporating the sequence similarity and gene ontology as biological features in three schemes. Therefore, ToBio considers both topological and biological features for effective global pathway similarity search. Specifically, various topological and biological features including subgraph signature similarities, sequence similarities, and gene ontology similarities are considered in ToBio. Since different features carry different functional importance and dependences, we report three schemes of ToBio using different sets of features. In addition, to enhance the existing search algorithms for rigorous comparisons, post-processing pipelines are also proposed to investigate how different features can contribute to the search performance. ToBio and other state-of-the-art methods are benchmarked on the gold-standard pathway datasets from three species; the results demonstrate the competitive edges of ToBio over the state-of-the-arts ranging from the topological aspects to the biological aspects. Case studies have been conducted to reveal mechanistic insights into the unique search performance of ToBio.

The third method is a novel algorithmic framework for pathway similarity search, named PathEmb, which is analogous to the Skip-gram model where each pathway is represented as a "document". PathEmb exploits a second order random walk strategy to explore diverse pathway patterns. All signaling paths traversed from random walks are regarded as "sentences", which are constituted as a "document" afterwards. Then, the "document" pattern for the individual pathway is mapped into a low-dimensional feature space for downstream tasks. Furthermore, PathEmb is a topology-free pathway similarity search algorithm, which is feasible to handle any pathway with arbitrary structure. We have extensively evaluated PathEmb and other cutting-edge methods on three pathway datasets. The experimental results demonstrate that PathEmb outperforms the existing methods in terms of computational efficiency and search accuracy.
Date of Award5 Sept 2018
Original languageEnglish
Awarding Institution
  • City University of Hong Kong
SupervisorKa Chun WONG (Supervisor)

Cite this

'