In the field of machine learning, the paradigm of active learning is regarded as most
similar to the mode of natural learning in which the learner asks teacher questions and then
obtains the knowledge from the answer given. Active learning has been typically combined
with supervised learning. In this thesis, the author addresses the small sample problems in
active learning that the small number of labeled data may lead to poor classifiers during the
active learning process. This important aspect of active learning has been largely ignored by
researchers. By combining active learning with Support Vector Machine (SVM) to solve the
small sample problems, three schemes are utilized in this thesis. The following paragraphs
overview the contributions of this thesis.
The first scheme aims to improve the quality of the selected data, i.e., choose the data that
convey more information to the learner. In traditional active learning, the similarity
measurement between two data only relies on themselves. It may lead to low quality of the
selected data, as they do not reflect information on the whole dataset. In order to select data
with high quality, the motivation is to incorporate unlabeled data to improve the similarity
measurement between the data. In order to support the proposed methods using this scheme, a
criterion that the selected data should maximize the class entropy is suggested from the view
point of information theory. Based on this principle, a general data selection framework is
inferred that the selected data should be as dissimilar from each other as possible. Then under
the inferred framework, three similarity measurements and three novel active learning methods
are proposed and derived respectively.
For the first proposed method, the weight of an edge in a graph is directly employed as the
similarity between two data points connected by this edge. During the process of active
learning, a single-side strategy is proposed to modify the weights of the graph. This leads to a
dynamic similarity measurement which is able to unfold the similarity between data better than
fixed strategies, such as the well-known Angle Diversity incorporated Active Learning with
SVM (AD-ALSVM). For the second proposed method, the data are first mapped into a manifold space which is extracted from the graph Laplacian, and then the similarities are calculated
based on the mapped data. Such a strategy is useful especially when inherent manifold
structures exist in the dataset. In addition to the manifold structures, geometry information can
also be extracted from the graph Laplacian. The extracted geometry information can be
wrapped into a kernel function for classification using SVM. The modified kernel matrix can
be employed as the similarity matrix. It leads to the third proposed method in which the
similarity is adaptive to the geometry structures hidden in the data selection pool.
The second common scheme is to enlarge the labeled training set by adding
pseudo-labeled data which are created from the set of unlabeled data. Different from other
methods, a weak label propagation method is proposed to create pseudo-labeled data, and they
are combined into the labeled training set to form an enlarged training set. Instead of using a
commonly deployed a sample-weighted SVM, a weighted margin SVM is applied to train the
classifiers during the active learning process.
The third scheme of overcoming the small sample problems is to reduce the probability of
forming a biased version space during the active learning process. One negative influence
resulting from the small size of the labeled data is that the version space formed on these
labeled data is prone to being biased, i.e., it does not include the target hypothesis. However,
almost all the active learning methods explicitly or implicitly assume that the current version
space includes the target hypothesis, such as the well-known Simple Distance-based Active
Learning with SVM (SD-ALSVM), AD-ALSVM and query-by-committee. Therefore they
focus on the exploitation ability of finding the target hypothesis by minimizing the current
hypothesis space, while neglecting the exploration ability of finding the unexploited
hypothesis space in which the target hypothesis may exist. Unlike the traditional deterministic
active learning, in this thesis, simulated annealing active learning is proposed, in which both
exploitation and exploration ability are invoked in the learning process, meanwhile a method is
proposed to automatically regulate between these two strategies. All the proposed methods are
shown to be efficient in solving the small problems in active learning in our experiments.
Finally, an adaptive active learning approach is designed by applying one of the
techniques mentioned above to the phosphorylation prediction system to overcome the problems associated with sample annotation. The experiments show that this adaptive method
is able to significantly reduce the number of annotated samples and it is more efficient than the
AD-ALSVM which was utilized in our previous work, where active learning approaches were
applied in the phosphorylation prediction system for the fist time. Hence, it leads to an
effective tool to assist biologists to select the most informative sample to annotate in a large
protein database.
| Date of Award | 15 Jul 2010 |
|---|
| Original language | English |
|---|
| Awarding Institution | - City University of Hong Kong
|
|---|
| Supervisor | Ho Shing Horace IP (Supervisor) |
|---|
- Supervised learning (Machine learning)
On small sample problems in active learning
JIANG, J. (Author). 15 Jul 2010
Student thesis: Doctoral Thesis