Active Learning under Complex Data Scenarios


Student thesis: Doctoral Thesis

View graph of relations


Related Research Unit(s)


Awarding Institution
  • Antoni B. Chan (Supervisor)
  • Qing Li (External person) (External Co-Supervisor)
Award date23 Feb 2023


In real-world applications, obtaining a large amount of unlabeled data is relatively easy, but difficult to get their labels since the labeling process is expensive and time-consuming. Most Machine Learning (ML) applications, especially for Deep Learning (DL) models, are data-hungry and usually rely on extensive labeled data to deliver good model performance. Active learning (AL) is proposed to solve label-insufficient problems. It aims to optimize basic learned model(s) iteratively by selecting and annotating unlabeled data samples that are deemed to best maximize the model performance with minimal required data.

Two main criteria are widely adopted in constructing AL querying strategies: informativeness- and representativeness-based measures. These single-criterion-based measures are simple and efficient but need more adaptability to real-life data scenarios. Multiple-criteria-based active learning algorithms, which incorporate multiple complementary criteria, i.e., informativeness, representativeness, and diversity, are needed to make appropriate selections in the active learning rounds under different data types. We consider the selection process in AL as a Determinantal Point Process, which is a good balance among these criteria. We refine the query selection strategy by selecting the hardest classified unlabeled data sample and biasing towards the classifiers more suitable for the current data distribution.

Current AL sampling strategies have been tested in relatively simple and clean well-studied datasets/tasks. However, in real-life applications, the data scenarios are very complex. For instance, when collecting unlabeled data samples, unrelated data samples (e.g., out-of-distribution) might be mixed in with the task-related data. Most AL sampling schemes are not robust to such data scenarios. Achieving good AL performance under complex data scenarios (e.g., out-of-distribution data) is challenging. To deal with AL under an out-of-distribution data dilemma, we designed a Monte-Carlo Pareto Optimization for Active Learning framework, which selects optimal subsets of unlabeled samples with fixed batch size from the unlabeled data pool. We cast the AL sampling task as a multi-objective optimization problem. Thus we utilize Pareto optimization based on two conflicting objectives: (1) the typical AL data sampling scheme (e.g., maximum entropy), and (2) the confidence of not being an out-of-distribution sample.

When the focus on the performance of active learning under complex data scenarios turns to the active learning sampling schemes, we found that the whole AL sampling and training processes suffer from bias problems. Many existing AL works focus on designing acquisition functions based on fixed heuristics for data collection. These fixed heuristics during data collection result in "sampling bias", which is inevitable during the whole AL process. Given labeled training set with sampling bias, an unbiased and consistent estimator of a basic model in passive learning might no longer be unbiased, even asymptotically. We explore the relationship between data collection and model fitting stages in AL and discuss crucial factors for designing AL approaches that help reduce the negative effects of sampling bias. We propose a flexible AL framework that can be applied to existing AL sampling schemes by minimizing the generalization error and re-weighted training loss in each stage.

Although many AL methods which are designed for classical ML or DL tasks have been developed, some essential questions remain unanswered, such as how to: 1) determine the current state-of-the-art techniques; 2) evaluate the relative benefit of new methods for various properties of the dataset; 3) understand what specific problems merit greater attention; 4) measure the progress of the field over time. We propose benchmarking pool-based AL methods with various datasets and quantitative metrics and draw insights from the comparative empirical results. We conduct comparative experiments with two branches: pool-based AL with classical ML tasks and DL tasks. Primarily, we construct a toolbox for Deep Active Learning, called DeepAL+, by re-implementing 19 highly-cited Deep Active Learning methods. We hope that our benchmarking test and toolbox could bring authentic comparative evaluation for pool-based AL, provide a quick look at which AL methods are more effective and the challenges and possible research directions in AL, and offer guidelines for conducting fair comparative experiments for future AL methods.

    Research areas

  • Active learning, Machine Learning, Deep Learning, Supervised learning (Machine learning)