On small sample problems in active learning
Student thesis: Doctoral Thesis
Related Research Unit(s)
In the field of machine learning, the paradigm of active learning is regarded as most similar to the mode of natural learning in which the learner asks teacher questions and then obtains the knowledge from the answer given. Active learning has been typically combined with supervised learning. In this thesis, the author addresses the small sample problems in active learning that the small number of labeled data may lead to poor classifiers during the active learning process. This important aspect of active learning has been largely ignored by researchers. By combining active learning with Support Vector Machine (SVM) to solve the small sample problems, three schemes are utilized in this thesis. The following paragraphs overview the contributions of this thesis. The first scheme aims to improve the quality of the selected data, i.e., choose the data that convey more information to the learner. In traditional active learning, the similarity measurement between two data only relies on themselves. It may lead to low quality of the selected data, as they do not reflect information on the whole dataset. In order to select data with high quality, the motivation is to incorporate unlabeled data to improve the similarity measurement between the data. In order to support the proposed methods using this scheme, a criterion that the selected data should maximize the class entropy is suggested from the view point of information theory. Based on this principle, a general data selection framework is inferred that the selected data should be as dissimilar from each other as possible. Then under the inferred framework, three similarity measurements and three novel active learning methods are proposed and derived respectively. For the first proposed method, the weight of an edge in a graph is directly employed as the similarity between two data points connected by this edge. During the process of active learning, a single-side strategy is proposed to modify the weights of the graph. This leads to a dynamic similarity measurement which is able to unfold the similarity between data better than fixed strategies, such as the well-known Angle Diversity incorporated Active Learning with SVM (AD-ALSVM). For the second proposed method, the data are first mapped into a manifold space which is extracted from the graph Laplacian, and then the similarities are calculated based on the mapped data. Such a strategy is useful especially when inherent manifold structures exist in the dataset. In addition to the manifold structures, geometry information can also be extracted from the graph Laplacian. The extracted geometry information can be wrapped into a kernel function for classification using SVM. The modified kernel matrix can be employed as the similarity matrix. It leads to the third proposed method in which the similarity is adaptive to the geometry structures hidden in the data selection pool. The second common scheme is to enlarge the labeled training set by adding pseudo-labeled data which are created from the set of unlabeled data. Different from other methods, a weak label propagation method is proposed to create pseudo-labeled data, and they are combined into the labeled training set to form an enlarged training set. Instead of using a commonly deployed a sample-weighted SVM, a weighted margin SVM is applied to train the classifiers during the active learning process. The third scheme of overcoming the small sample problems is to reduce the probability of forming a biased version space during the active learning process. One negative influence resulting from the small size of the labeled data is that the version space formed on these labeled data is prone to being biased, i.e., it does not include the target hypothesis. However, almost all the active learning methods explicitly or implicitly assume that the current version space includes the target hypothesis, such as the well-known Simple Distance-based Active Learning with SVM (SD-ALSVM), AD-ALSVM and query-by-committee. Therefore they focus on the exploitation ability of finding the target hypothesis by minimizing the current hypothesis space, while neglecting the exploration ability of finding the unexploited hypothesis space in which the target hypothesis may exist. Unlike the traditional deterministic active learning, in this thesis, simulated annealing active learning is proposed, in which both exploitation and exploration ability are invoked in the learning process, meanwhile a method is proposed to automatically regulate between these two strategies. All the proposed methods are shown to be efficient in solving the small problems in active learning in our experiments. Finally, an adaptive active learning approach is designed by applying one of the techniques mentioned above to the phosphorylation prediction system to overcome the problems associated with sample annotation. The experiments show that this adaptive method is able to significantly reduce the number of annotated samples and it is more efficient than the AD-ALSVM which was utilized in our previous work, where active learning approaches were applied in the phosphorylation prediction system for the fist time. Hence, it leads to an effective tool to assist biologists to select the most informative sample to annotate in a large protein database.
- Supervised learning (Machine learning)