Long Reads Mapping and Aligning, Protein-protein Interactions Predicting and Mutation Analysis of Retinitis Pigmentosa with NGS Data

長基因序列定位與比對,蛋白質相互作用預測及視網膜色素變性的基因突變分析

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date3 Jan 2023

Abstract

In this thesis, we studied three problems in bioinformatics and computational biology. The first problem is long reads mapping and aligning to reference, the second problem is predicting protein-protein interactions with a novel network-based algorithm, and the third problem is mutation analysis of retinitis pigmentosa with the next-generation sequencing (NGS) data.

Long reads play an important role in the identification of structural variants, sequencing repetitive regions, phasing of alleles, etc. In this thesis, we propose a new approach for mapping long reads to reference genomes. We also propose a new method to generate accurate alignments of the long reads and the corresponding segments of the reference genome. The new mapping algorithm is based on the longest common sub-sequence with distance constraints. The new (local) alignment algorithm is based on the idea of recursive alignment of variable size k-mers. We have implemented all the algorithms in C++ and produced a software package named mapAlign. Experiments show that our new method can generate better alignments in terms of both identity and alignment scores for both Nanopore and SMRT data sets. In particular, our method can align 91.53% and 85.36% of letters on reads to identical letters on reference genomes for human individuals of Nanopore and SMRT data sets, respectively. The state-of-the-art method can only align 88.44% and 79.08% letters of reads for Nanopore and SMRT data sets, respectively. Our method is also faster than the state-of-the-art method.

Protein-protein interactions (PPIs) play an essential role in most biological processes in cells. Many computational algorithms have thus been proposed to predict PPIs. However, most of them heavily rest on the biological information of proteins while ignoring the latent structural features of proteins presented in a PPI network. In this thesis, we propose an efficient network-based prediction algorithm, namely PPISB, based on a mixed membership stochastic blockmodel. By simulating the generative process of a PPI network, PPISB is able to capture the latent community structures. The inference procedure adopted by PPISB further optimizes the membership distributions of proteins over different complexes. After that, a distance measure is designed to compute the similarity between two proteins in terms of their likelihood of being in the same complex, thus verifying whether they interact with each other. We conducted extensive experiments to evaluate the performance of PPISB with five PPI networks collected from different species. The results demonstrate that PPISB has a promising performance when applied to predict PPIs in terms of several evaluation metrics. Hence, we reason that PPISB is preferred over state-of-the-art network-based prediction algorithms, especially for predicting potential PPIs.

Retinitis pigmentosa (RP) is one typical representative of inherited eye diseases (IEDs). Inherited eye diseases of human make people suffer many kinds of vision problems, e.g., night blindness, severe vision loss, and visual acuity decline. Genes play an important role in RP, and patients with RP usually suffer from one or more gene malformations. However, the strong genetic heterogeneity of RP often causes difficulty in clinical diagnosis. Panel-based next-generation sequencing data can retrieve specified portions of the genome from subjects. It is an economical, precise and time-efficient way to study the genetic mutations of a particular disease. In this thesis, we applied panel-based NGS to a group of RP-related individuals and designed a pipeline to study the pathogenic mutations of the subjects. We identified 45 variants, and the number of pathogenic variants, likely pathogenic variants and uncertain clinical significance variants of these 45 variants, according to the guidelines of the American College of Medical Genetics and Genomics (ACMG), is 22, 14 and 9, respectively. The missense is the most frequent functional change type and accounts for 53% (24/45) of these variants. To our knowledge, 22 of the 45 variants are reported for the first time. Our work presented a practical example for applying panel-based NGS on RP and extended the existing genotype spectrum of RP, which can benefit future genetic counseling or therapy in northeast Chinese RP patients.