Introduction to Active Learning

What is Active Learning?

Published in

Towards Data Science

7 min readMar 17, 2020

The goal of this post is to help demystify active learning and show how it differs from standard supervised machine learning.

First, what is active learning? Active learning is a machine learning framework in which the learning algorithm can interactively query a user (teacher or oracle) to label new data points with the true labels. The process of active learning is also referred to as optimal experimental design.

The motivation for active learning is a scenario in which we have a large pool of unlabelled data. Consider the problem of training an image classification model to distinguish between cats and dogs. There are millions of images out there of each, but not all of them are needed to train a good model. Some images may offer more clarity and information than others. Another similar application is classifying the content of Youtube videos, where the data is inherently dense and there exists a lot of it.

Passive learning, the standard framework in which a large quantity of labelled data is passed to the algorithm, requires significant effort in labelling the entire set of data.

By using active learning, we can selectively leverage a system like crowd-sourcing, to ask human experts to selectively label some items in the data set, but not have to label the entirety. The algorithm iteratively selects the most informative examples based on some value metric and sends those unlabelled examples to a labelling oracle, who returns the true labels for those queried examples back to the algorithm.

Active Machine Learning. Credit: Inspired by Yi Zhang’s active learning slides

In several cases, active learning performs better than random sampling. The below image shows a motivating example of active learning’s improvement over random selection. The entire set of data points (union of the sets of red triangles and green circles) is not linearly separable.

Credit: Image from Burr Settles’ active learning slides.

Active learning is motivated by the understanding that not all labelled examples are equally important. With uniform random sampling over all of the examples, the learned model doesn’t quite represent the division between classes. However, active learning selects examples near the class boundary, and is able to find a more representative classifier. Previous research has also shown that active learning offers improvement over standard random selection for tasks like multi-class image classification [1, 2, 3, 4].

The active learning framework reduces the selection of data to a problem of determining which are the most informative data points in the set? In active learning, the most informative data points are generally the ones that the model is most uncertain about. This necessitates various metrics to quantify and compare uncertainty of examples.

Different Active Learning Frameworks

Active learning is considered to be a semi-supervised learning method, between unsupervised being using 0% of the learning examples and fully supervised being using 100% of the examples. By iteratively increasing the size of our labelled training set, we can achieve greater performance, near fully-supervised performance, with a fraction of the cost or time to train using all of the data.

Pool-based Active Learning

In pool-based sampling, training examples are chosen from a large pool of unlabelled data. Selected training examples from this pool are labelled by the oracle.

Stream-based Active Learning

In stream-based active learning, the set of all training examples is presented to the algorithm as a stream. Each example is sent individually to the algorithm for consideration. The algorithm must make an immediate decision on whether to label or not label this example. Selected training examples from this pool are labelled by the oracle, and the label is immediately received by the algorithm before the next example is shown for consideration.

Uncertainty Measures

The decision for selecting the most informative data points is dependent on the uncertainty measure used in selection. In pool-based sampling, the active learning algorithm selects examples to add to the growing training set that are the most informative.

The most informative examples are the ones that the classifier is the least certain about.

The intuition here is that the examples for which the model has the least certainty will likely be the most difficult examples — specifically the examples that lie near the class boundaries. The learning algorithm will gain the most information about the class boundaries by observing the difficult examples.

Below are four common uncertainty measures used in active learning to select the most informative examples.

1. Smallest Margin Uncertainty

The smallest margin uncertainty is a best-versus-second-best uncertainty comparison. The smallest margin uncertainty (SMU) is the classification probability of the most likely class minus the classification probability of the second most likely class [1]. The intuition behind this metric is that if the probability of the most likely class is significantly greater than the probability of the second most likely class, then the classifier is more certain about the example’s class membership. Likewise, if the probability of the most likely class is not much greater than the probability of the second most likely class, then the classifier is less certain about the example’s class membership. The active learning algorithm will select the example with the minimum SMU value.

2. Least Confidence Uncertainty

Least confidence uncertainty (LCU) is selecting the example for which the classifier is least certain about the selected class. LCU selection only looks at the most likely class, and selects the example that has the lowest probability assigned to that class.

3. Entropy Reduction

Entropy is the measure of the uncertainty of a random variable. In this experiment, we use Shannon Entropy. Shannon entropy has several basic properties, including (1) uniform distributions have maximum uncertainty, (2) uncertainty is additive for independent events, and (3) adding an outcome with zero probability has no effect, and (4) events with a certain outcome have zero effect [6, 7]. Considering class predictions as outcomes, we can measure Shannon entropy of the predicted class probabilities.

Higher values of entropy indicate greater uncertainty in the probability distribution [1]. In each active learning step, for every unlabelled example in the training set, the active learning algorithm computes the entropy over the predicted class probabilities, and selects the example with the highest entropy. The example with the highest entropy is the example for which the classifier is least certain about its class membership.

4. Largest Margin Uncertainty

The largest margin uncertainty is a best-versus-worst uncertainty comparison [5]. The largest margin uncertainty (LMU) is the classification probability of the most likely class minus the classification probability of the least likely class. The intuition behind this metric is that if the probability of the most likely class is significantly greater than the probability of the least likely class, then the classifier is more certain about the example’s class membership. Likewise, if the probability of the most likely class is not much greater than the probability of the least likely class, then the classifier is less certain about the example’s class membership. The active learning algorithm will select the example with the minimum LMU value.

Algorithm

The algorithm below is one for pool-based active learning [8]. Stream-based active learning can be similarly written.

A principle bottleneck in large-scale classification tasks is the large number of training examples needed for training a classifier. Using active learning, we can reduce the number of training examples needed to teach a classifier by strategically selecting particular examples.

References

[1] A. J. Joshi, F. Porikli and N. Papanikolopoulos, “Multi-class active learning for image classification,” 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, 2009, pp. 2372–2379.

[2] Guo-Jun Qi, Xian-Sheng Hua, Yong Rui, Jinhui Tang and Hong-Jiang Zhang, “Two-Dimensional Active Learning for image classification,” 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, 2008, pp. 1–8.

[3] E. Y. Chang, S. Tong, K. Goh, and C. Chang. “Support vector machine concept-dependent active learning for image retrieval,” IEEE Transaction on Multimedia, 2005.

[4] A. Kapoor, K. Grauman, R. Urtasun and T. Darrell, “Active Learning with Gaussian Processes for Object Categorization,” 2007 IEEE 11th International Conference on Computer Vision, Rio de Janeiro, 2007, pp. 1–8.

[5] https://becominghuman.ai/accelerate-machine-learning-with-active-learning-96cea4b72fdb

[6] https://towardsdatascience.com/entropy-is-a-measure-of-uncertainty-e2c000301c2c

[7] L. M. Tiwari, S. Agrawal, S. Kapoor and A. Chauhan, “Entropy as a measure of uncertainty in queueing system,” 2011 National Postgraduate Conference, Kuala Lumpur, 2011, pp. 1–4.

[8] https://towardsdatascience.com/active-learning-tutorial-57c3398e34d