Deep Learning in Post-Translational Modification Site Prediction

Student thesis: Doctoral Thesis

Abstract

Post-translational modifications (PTMs) generally refer to the addition of functional groups (e.g., phosphates, acetates, small proteins, lipids, carbohydrates, etc.) to amino acids during translation. To date, over 650 different types of PTMs have been discovered in different proteins. PTMs are critical in maintaining protein structures, functions, metabolic regulation, cellular signaling, and proteomic diversity. In the recent decade, MS-based proteomic techniques play a major role in PTM identification, which yield solid data with actual evidence. In addition, computational methods can also explore and predict new modification sites by building a model from those data. In the last few years, machine learning has grown to be a cost-effective and labor-efficient method for the prediction of various PTM sites.

Specifically, deep learning is an advanced machine learning method that is capable of automatically exploring PTM patterns and capturing high-level abstraction. Therefore, researchers recently gradually shift their attention from traditional machine learning to deep learning for PTM site prediction. Despite the advantages, such studies can suffer from limitations, as models are often developed only for a limited number of PTM types, trained only on short protein sequences with limited information, and the architectures are somewhat outdated compared to more recent advancements.

This thesis aims to advance the development of effective and robust transformer-based models for predicting PTM sites. It introduces innovations by: 1. Incorporating underexplored types of PTMs such as non-histone acetylation and N6-carboxylysine, which are seldom addressed in other computational methods; and 2. Utilizing full-length protein sequences that host multiple PTM sites of the same type for both training and prediction. This approach contrasts with other studies that typically use shorter protein sequences extracted via a sliding window technique, which only contain a single PTM site.

In the first part, we firstly summarized and discussed the most recent progress made in the prediction of PTMs using deep learning-based methods with a particular emphasis on protein phosphorylation, methylation, acetylation, and ubiquitination sites. Moreover, we presented frequently used databases for deep learning based PTM prediction, along with future directions in the computational identification of PTMs.

In the second part, our research contributes both to the creation of datasets and to benchmarking predictors in non-histone acetylation site prediction area. We have developed a benchmark dataset for non-histone acetylation site prediction, named NHAC, which comprises 11 subsets categorized by sequence length ranging from 11 to 61 amino acids. This dataset contains 886 positive samples and 4707 negative samples for each sequence length. Based on this dataset, we propose TransPTM, a transformer-based computational model for non-histone acetylation site identification. This model utilizes the pre trained protein language model ProtT5 to construct the feature space for each site. The embedded protein sequence data are then processed by a Graph Neural Network (GNN), incorporating three TransformerConv layers for feature extraction and a multilayer perceptron (MLP) for classification. TransPTM has demonstrated superior performance in non-histone acetylation site prediction. This success not only enhances our understanding of the molecular mechanisms underlying non-histone acetylation but also provides a theoretical foundation for the development of new drug targets.

In the third part of research, we developed a significantly larger PTM dataset and a more robust PTM site prediction model compared to our second project. To broaden the application of PTM prediction models beyond non-histone acetylation to a wider array of PTM types, we created a benchmark dataset called PTMseq. This dataset encompasses 12,203 full-length protein sequences, ranging from 12 to 1000 amino acids in length, and includes 34,514 PTM sites across 9 different PTM types. Building on this dataset, we proposed a transformer-based PTM site prediction model named UniPTM. This model initially employs three pre-trained language models—ProtBert, ProtT5, and ESM-2—to embed the full-length sequences from the dataset. The embedded protein sequences are then input into a transformer model, which features three key components: a CNN layer that reduces the dimensionality of long protein sequences, an 8-head transformer for feature extraction, and 3 fully conneceted for classification. UniPTM has demonstrated outstanding performance across all 9 PTM types. In previous computational PTM site prediction models, researchers often used a sliding window technique to extract short protein sequences around PTM sites, as PTM sites are sparse on full-length proteins. This method, however, resulted in a loss of valuable contextual information from the rest of the protein. UniPTM is the first model to utilize these full-length sequences for PTM prediction, ensuring predictions are based on complete natural protein contexts.
Date of Award17 Sept 2024
Original languageEnglish
Awarding Institution
  • City University of Hong Kong
SupervisorKa Chun WONG (Supervisor) & Hongyan SUN (Supervisor)

Keywords

  • Post-translational modification
  • Deep learning
  • Large language models
  • Dataset Construction
  • Transformer

Cite this

'