Techniques for improving classification abilities in supervised learning and active learning

監督學習及主動學習中提高分類能力的技術

Student thesis: Doctoral Thesis

View graph of relations

Author(s)

  • Ran WANG

Related Research Unit(s)

Detail(s)

Awarding Institution
Supervisors/Advisors
Award date14 Feb 2014

Abstract

Supervised learning is an important branch of machine learning, which refers to inferring a mapping from a set of labeled training samples. The mapping could be a function, or a set of rules, which are used to predict labels of new samples. The most widely used supervised learning techniques for classification include support vector machines (SVMs), decision trees (DTs), and extreme learning machines (ELMs), etc. Although all these techniques have achieved good performances on certain domains, they suffer from different problems that may affect the generalization capability or learning efficiency. First, traditional SVMs are typically designed for binary classification problems based on statistical learning theory, extending them to multiclass cases still remain as a hot topic. Second, the initially proposed DTs can only deal with discrete valued attributes, some discretizations must be performed for continuous ones. Since almost all the existing approaches for the discretization are frequency based heuristics, they neglect the sample distribution and may affect the classification ability. Third, ELMs are emergent techniques for training single-hidden layer feedforward neural networks (SLFNs). Although it exhibits extremely fast learning speed, the randomly assigned input weights always lead to the instability problem. On the other hand, active learning is a revised supervised learning scheme that adopts the selective sampling manner. In some real-world machine learning applications, labeled samples may be inadequate to infer a good model. Although unlabeled ones are abundant, the manual labeling work is quite expensive. Active learning is an iterative process that allows the learner to select informative samples from numerous unlabeled ones. It can at least have some controls of the input domain by setting up a certain selection criterion. Besides, the selection of informative samples also reduces the data complexity and redundancy, thus improves learning efficiency. It is valuable to develop active learning models, so that the performances of supervised learning techniques can be further improved. Currently, the most important issue in active learning is to design an effective sample selection criterion for measuring the informativeness of unlabeled samples. The contribution of this thesis is mainly composed of two parts: the improvement of supervised learning techniques and the design of active learning models. As for the supervised learning part, we aim to figure out the disadvantages of the existing techniques, and discover promising directions for improving their classification abilities. First, we design a vector valued SVM model (VVD) for multiclass problems. The basic idea is to separate 2a classes by a SVM hyperplanes in the feature space induced by certain kernels. This model not only reduces the computational complexity for training and testing, but also eliminates the unclassifiable region (UR) problem that may affect the classification performance. Second, we propose a segment based decision tree induction model with continuous valued attributes. The segment of examples, which can differentiate attributes with same frequencies of classes, is proposed. Then, a new hybrid scheme that combines the two heuristics, i.e., segment and frequency, is developed to expand nodes during the decision tree induction. The relationship between the frequency and the expectation of the segment number, which is regarded as a random variable, is also given. Third, we make an analysis of ELM approximate error based on the random weight matrix. By analyzing the dimension increase process in ELM, we give an approximate relation between the uniformities before and after the linear transformation. Furthermore, by restricting ELM with a two-dimensional space, we give an upper bound of ELM approximate error that is dependent on the distributive uniformity of training samples. The analytic results provide some useful guidelines to improve ELM prediction accuracy. As for the active learning part, we aim to discover different techniques that can be used in designing the sample selection criteria. First, we develop an inconsistency based strategy under the guidance of two classical works, i.e., the learning philosophy of query-by-committee (QBC) algorithm and the traditional concept learning model of from-general-to-specific (GS) ordering. By constructing two extreme hypotheses of the current version space during each iteration, a GS learning structure is formed. It evaluates unlabeled examples by a new sample selection criterion as inconsistency value, and the whole learning process could be implemented without any additional knowledge. The model is shown to be effective on benchmark datasets, noisy data, handwritten digits recognition problem, and content-based image retrieval tasks. Second, we design a fuzzy rough set based active learning model, which measures the informativeness of unlabeled samples via the inconsistency between conditional features and decision labels. By forming a sample covering system with the lower approximations in fuzzy rough sets, the similarities between labeled and unlabeled samples could be discovered. Afterwards, the memberships of unlabeled samples belonging to each decision class could be derived, and the observed samples are then decided to be queried or not. This model is also shown to be effective on benchmark datasets and the handwritten digits recognition task. Finally, we propose an active learning framework with multi-criteria decision making (MCDM) systems, considering that the integration of multiple criteria can outperform each single one. By fixing the relations between any two unlabeled samples, a preference preorder could be generated for each criterion. Then, the dominated and dominating indices of unlabeled samples are calculated. The least dominated and most dominating one will be treated as the most informative one, thus is likely to be queried. This model can improve both the generalization capability and learning efficiency under the multiple-instance learning (MIL) environment.

    Research areas

  • Supervised learning (Machine learning)