Pattern Recognition in Biological Sequences and Matrices

生物序列和矩陣中的模式識別

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date22 Jun 2022

Abstract

Pattern recognition is an essential task in bioinformatics. Biological pattern often conveys information about functional roles, organism identity, or phylogeny. The inherent properties of biological data, such as frequent variations and high noise, complicate the task. During my Ph.D. studies, my primary interest lies in exploring novel computational approaches for pattern recognition. I focus on two formats of biological data, sequence and matrix.

For biological sequence, we started with the splice site prediction, which serves as a necessary step to study the location and structure of genes. We proposed SpliceFinder, which leverages a convolutional neural network (CNN) to extract distinct patterns on splice sites. The genomic sequences are encoded with one-hot encoding to feed into the CNN. We applied an iterative approach to reconstruct the dataset, which forces the model to learn more informative patterns on splice sites.

Next, we focused on phage-related patterns. We designed an alignment-based method to detect phage-related patterns from bacterial next-generation sequencing (NGS) data, named TemPhD. Applying TemPhD to public datasets resulted in 192,326 complete temperate phage genomes, which expanded the amount of existing data by more than 100-fold. The large collection of phage sequences allows for designing neural network-based methods.

Then we adopted neural network-based methods to identify phage patterns. However, the one-hot encoding method is inappropriate for sequences with variable lengths and frequent variations, such as phage sequences. To solve the issue, we designed a genome encoding method that applies various spaced k-mer pairs to tolerate sequence variations. Based on the encoding method, we proposed a phage host prediction tool named DeepHost. DeepHost achieves better prediction accuracy than other tools, and it also performs well on the sequences with less homology in the datasets.

The shortcoming of DeepHost is the failure to model interactions between k-mers. Accordingly, we proposed encoding of nucleic sequence into gapped pattern graph, which can then be filtered through a Graph Convolutional Network (GCN) to form lower-dimensional embedding for downstream tasks. Four phage-related tasks were examined in this work: phage and ICE discrimination, phage integration site prediction, phage lifestyle prediction, and phage host prediction. Our resultant framework from the proposed encoding scheme, called GraphPhage, outperforms state-of-the-art methods under various metrics for all four tasks.

Single-cell data are often treated as matrices for analyses, with rows as cells and columns as features (genes, locations, etc.). Single-cell DNA sequencing technologies allow us to find local copy number variation (CNV) patterns along the genome. We proposed SeCNV, an efficient method that leverages structural entropy to divide the genome into segments with similar CNV patterns and profiles the copy numbers. From the cell-bin read coverage matrix, SeCNV adopts a local Gaussian kernel to construct depth congruent map, capturing the similarities between any two bins along the genome. Then SeCNV partitions the genome into segments by minimizing the structural entropy from the depth congruent map. SeCNV successfully processed large datasets (> 50,000 cells) within four minutes.

The recently developed single-cell multi-omics technologies enable simultaneous measurement for multiple types of molecules at single-cell resolution, thus enabling us to study the integrated latent patterns across omics. The multiple matrices constitute a tensor, i.e., the higher-rank matrix. Consequently, we proposed a probabilistic tensor decomposition framework to integrate single-cell multi-omics data, named SCOT. Applying SCOT to seven single-cell multi-omics datasets, we showed that SCOT learns informative latent patterns for cells and genes, allowing various downstream analyses, including cell clustering, cross-omics gene expression analysis, gene regulatory network study, and multi-omics imputation.

    Research areas

  • Bioinformatics, Machine learning, Algorithm