Classification on major histocompatibility complex II molecules

基於主要組織相容性復合體二類分子分類的研究

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

  • Ying FAN

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date3 Oct 2014

Abstract

In this thesis, we mainly study the classification on MHC II molecules. Besides, we also identify protein-protein binding sites with a combined energy function. During the research on computational biology and bioinformatics, some algorithms based on graph theory are also proposed. The major histocompatibility complex (MHC), a cell-surface protein mediating immune recognition, plays important roles in the immune response system of all higher vertebrates. MHC molecules are highly polymorphic and they are grouped into serotypes according to the specificity of the response. The classification of MHC II molecules into similar response groups is important to the development of epitopebased vaccines. Here, we study the topic from two perspective. It is a common belief that a protein sequence determines its three dimensional structure and function. Hence, the protein sequence determines the serotype. Residues play different levels of importance. In this paper, we quantify the residue significance with the available serotype information. Knowing the significance of the residues will deepen our understanding of the MHC molecules and yield us a concise representation of the molecules. So firstly, we propose a linear programming-based approach to find significant residue positions as well as quantifying their significance in MHC II DR molecules. Among all the residues in MHC II DR molecules, 18 positions are of particular significance, which is consistent with the literature on MHC binding sites, and succinct pseudo-sequences appear to be adequate to capture the whole sequence features. When the result is used for classification of MHC molecules with serotype assigned by WHO, a 98.4% prediction performance is achieved. The methods have been implemented in java (http://code.google.com/p/quassi/). Most existing methods classify MHC II molecules based on binding data. An alternative is to base the classification on sequence data; such an approach is justified by Anfinsen's dogma, which states that a protein's sequence determines its structure, and hence its functions. Shen et al. showed an effective kernel method based on this approach. However, the method determines the number of clusters through inspection, and also left some important sequences unclassified. Thus secondly, we propose a natural solution to these two issues. We employ Bayesian information criterion to determine the number of clusters, utilize direction information for sequence classification, and uses phylogenetic trees for cluster creation. Experimental results show that our method achieves 94.5% accuracy, a significant improvement over the 78% accuracy of the earlier method. Determination of binding sites between proteins has a wide range of applications, including signal transduction studies, de novo drug design and structure identification. A complex may contain several protein subunits and multiple binding interfaces. The binding sites can be predicted by identifying complementary regions at protein surfaces. Understanding energetics and mechanism of complexes remains one of the essential problems in binding site prediction. We develop a system, P-Binder, for identifying binding sites based on shape complementarity, side-chain conformations and interacting amino acid information. P-Binder utilizes an enumeration method to generate all possible configurations between two proteins, and uses a side-chain packing program to identify the bound states. The system reports the binding sites with the highest ranked configurations, evaluated through a linear combination of four statistical energy items. Our results show that this approach performs better than other existing methods in binding site prediction. A comparison with some existing techniques shows P-Binder to improve the success rate by at least 12.3%. The system reports improvements in prediction quality, in terms of both accuracy and coverage. We test P-Binder on proteinprotein docking Benchmark v4.0. The overall accuracy and coverage are 63.8% and 68.8% for the bound state, and 51.0% and 60.9% for the unbound state. The study of graph theory and algorithms is crucial to the improvement of bioinformatics. Finding two disjoint matchings is a hot research problem which has been studied for many years. Hence, we also propose a parameterized and approximation algorithm to address this question at last.

    Research areas

  • Major histocompatibility complex, Bioinformatics, Protein binding, Mathematical models