Efficient Algorithms for Identification of Modified Proteoforms Using Top-down Mass Spectra

Project: Research

View graph of relations


Mass spectrometry-based top-down proteomics has become the most informative approaches in protein analysis since it provides the aerial view of all intact proteoforms generated from posttranslational modifications and sequence variations. Recent development in mass spectrometry and intact protein separation technologies have expanded the territory of top-down proteomics from single protein analysis towards proteome-wide analysis. Such kind of new approaches has demonstrated unique advantages in understanding proteoform functions, discovering disease molecule signatures, and identifying possible drug targets. A major challenge in proteoform identification by database search is the combinatorial explosion of possible proteoforms resulting from combinations of sequence variations, post-translational modifications, and other molecular events, such as protein degradation. Existing top-down proteomics software tools (e.g., big Mascot, ProSightPC, MS-Align+) have to exclude most of the potential proteoforms from the searching database to keep its size manageable, significantly limiting their ability to identify complex proteoforms. The notation ofproteoformmassgraph has been proposed to represent a (huge size) set of potential proteoforms. In this project, we will design and implement fast algorithms that precisely identify complex proteoforms at the proteome level using proteoform mass graphs and top-down tandem mass spectra. We will consider the case where the target protein is known in advance as well as the case where the target protein is not known. For the second case, the database contains a large number of proteoform mass graphs, one for each protein sequence. We will design algorithms to quickly filter the proteoform mass graphs in the database to identify a small number of proteoform mass graphs that are similar to the query spectral mass graph.  Bottom-up MS-based proteogenomics has demonstrated that customized protein databases derived from RNA-Seq data significantly increase proteome coverage and improve the ability to identify sample-specific peptides. We plan to extract from RNA-Seq data sample-specific sequence variations, which will be incorporated into sample-specific mass graphs. These customized mass graphs will facilitate the identification of sample-specific proteoforms, especially cancer-specific proteoforms. We will design method to construct thesample-specific massgraphs representing proteoforms based on RNA-Seq data. 


Project number9042817
Grant typeGRF
Effective start/end date1/01/20 → …