Classification on major histocompatibility complex II molecules
基於主要組織相容性復合體二類分子分類的研究
Student thesis: Doctoral Thesis
Author(s)
Related Research Unit(s)
Detail(s)
Awarding Institution | |
---|---|
Supervisors/Advisors |
|
Award date | 3 Oct 2014 |
Link(s)
Permanent Link | https://scholars.cityu.edu.hk/en/theses/theses(a8105f17-c227-4f57-98e7-3fd062a0fc4f).html |
---|---|
Other link(s) | Links |
Abstract
In this thesis, we mainly study the classification on MHC II molecules. Besides, we
also identify protein-protein binding sites with a combined energy function. During the
research on computational biology and bioinformatics, some algorithms based on graph
theory are also proposed.
The major histocompatibility complex (MHC), a cell-surface protein mediating immune
recognition, plays important roles in the immune response system of all higher
vertebrates. MHC molecules are highly polymorphic and they are grouped into
serotypes according to the specificity of the response. The classification of MHC II
molecules into similar response groups is important to the development of epitopebased
vaccines. Here, we study the topic from two perspective.
It is a common belief that a protein sequence determines its three dimensional structure
and function. Hence, the protein sequence determines the serotype. Residues play
different levels of importance. In this paper, we quantify the residue significance with
the available serotype information. Knowing the significance of the residues will deepen
our understanding of the MHC molecules and yield us a concise representation of the
molecules. So firstly, we propose a linear programming-based approach to find significant
residue positions as well as quantifying their significance in MHC II DR molecules.
Among all the residues in MHC II DR molecules, 18 positions are of particular significance,
which is consistent with the literature on MHC binding sites, and succinct
pseudo-sequences appear to be adequate to capture the whole sequence features. When
the result is used for classification of MHC molecules with serotype assigned by WHO,
a 98.4% prediction performance is achieved. The methods have been implemented in
java (http://code.google.com/p/quassi/).
Most existing methods classify MHC II molecules based on binding data. An alternative
is to base the classification on sequence data; such an approach is justified
by Anfinsen's dogma, which states that a protein's sequence determines its structure,
and hence its functions. Shen et al. showed an effective kernel method based on this
approach. However, the method determines the number of clusters through inspection,
and also left some important sequences unclassified. Thus secondly, we propose a natural
solution to these two issues. We employ Bayesian information criterion to determine
the number of clusters, utilize direction information for sequence classification, and uses
phylogenetic trees for cluster creation. Experimental results show that our method
achieves 94.5% accuracy, a significant improvement over the 78% accuracy of the earlier
method.
Determination of binding sites between proteins has a wide range of applications,
including signal transduction studies, de novo drug design and structure identification.
A complex may contain several protein subunits and multiple binding interfaces. The
binding sites can be predicted by identifying complementary regions at protein surfaces.
Understanding energetics and mechanism of complexes remains one of the essential
problems in binding site prediction. We develop a system, P-Binder, for identifying
binding sites based on shape complementarity, side-chain conformations and interacting
amino acid information. P-Binder utilizes an enumeration method to generate all
possible configurations between two proteins, and uses a side-chain packing program to
identify the bound states. The system reports the binding sites with the highest ranked
configurations, evaluated through a linear combination of four statistical energy items.
Our results show that this approach performs better than other existing methods in
binding site prediction. A comparison with some existing techniques shows P-Binder
to improve the success rate by at least 12.3%. The system reports improvements in prediction
quality, in terms of both accuracy and coverage. We test P-Binder on proteinprotein
docking Benchmark v4.0. The overall accuracy and coverage are 63.8% and
68.8% for the bound state, and 51.0% and 60.9% for the unbound state.
The study of graph theory and algorithms is crucial to the improvement of bioinformatics.
Finding two disjoint matchings is a hot research problem which has been studied for many years. Hence, we also propose a parameterized and approximation
algorithm to address this question at last.
- Major histocompatibility complex, Bioinformatics, Protein binding, Mathematical models