Active Learning Sampling Strategies

Hardik Dave
13 min readMay 29, 2020

In this article, we will first understand the core concepts behind Active Learning and then we will look into sampling strategies used in Active Learning.

Photo by Franki Chamaki on Unsplash

What is Active Learning?

Active Learning is a technique in machine learning through which a learning algorithm specifically looks for the data which is most informative to the model instead being trained on whole dataset.

In a fully supervised learning approach, we need to have a lot of labeled data for the model to train on. Supervised learning methods require a large amount of labeled data for training. But what if we have a large pool of unlabeled data only?

So to apply supervised machine learning algorithms on this large pool of unlabeled dataset, we need to label them all. In that case, we need to have a human expert who can label every data instance. But this process would be too expensive and time-consuming.

Here, Active Learning comes to rescue. The basic idea is that we will label only a few data samples for training and still we will be able to achieve high accuracy. Active learning says that we will select a few specific unlabeled data samples for labeling by a human expert. And then we will train our model on those few data samples. Active Learning guarantees that even if we have chosen only a few data samples instead of the whole dataset for labeling and training, we can still achieve similar or higher accuracy/performance than a model trained on the whole labeled dataset.

Sometimes, this process is implemented as an iterative process which means we will select a few numbers of samples for labeling repeatedly from the unlabeled pool of samples.

We can consider active learning as a semi-supervised learning approach. A fully supervised learning algorithm utilizes a full labeled dataset for training whereas an unsupervised learning algorithm considers only unlabelled data. The active learning approach does not require a whole set of data for training thus it can be called a semi-supervised learning method.

Image Credit (Settles, 2009)

The above figure is a pictorial representation of the process of pool-based active learning which is sourced from a survey paper on active learning. Here the human expert is termed as an “oracle”.

From binary classification example, we can easily understand and visualize why we select only a few specific data samples for labeling.

Image Credit (Settles, 2009)

In the above figure, (a) represents a large pool of unlabeled dataset. Now if we randomly select some samples for labeling and training, the algorithm will predict the decision boundary as shown in (b). Now if we employ active learning sampling strategies for selecting data samples for labeling and training, the active learning algorithm will only select the data samples which are closest to the actual decision boundary disregarding irrelevant data samples from consideration which can be shown in ©.

The data samples selected in © are responsible for separating two classes and thus are the most informative and useful data samples for the model to learn from. The active learning technique aims to select this type of informative data samples which can improve the learning of the model and enhance the discriminative and predictive ability of the model.

According to the literature on active learning, there are three different learning scenarios for the active learner :

  1. Membership Query Synthesis: In this, the active learning algorithm generates a new unlabeled instance within the input space and queries the oracle for labeling.
  2. Stream-based Selective Sampling: In stream-based selective sampling, the unlabeled data sample is continuously being sent from the data source to the active learner. The active learner has to decide whether to ask the oracle to label the current data sample or not based on a query strategy. The distribution of the data coming from the data source may change over time.
  3. Pool-based Sampling: In this scenario, the data samples are chosen for labeling by the oracle from the pool of unlabeled data samples. In contrast to stream-based selective sampling, we can focus on more than one data sample at a time. Most informative data samples are selected from the pool of unlabeled data samples based on some sampling strategies or informativeness measure. This is the most common learning situation for active learning algorithms.

We have seen that the active learning algorithm selects a few specific data samples for labeling from the pool of unlabeled dataset. So what are these “specific” data samples and how do we select them?

The process of selecting specific data samples is based on Active Learning sampling strategies which we will understand now. Active learning uses these sampling strategies designed specifically for selecting the most useful and informative data samples from which the model can learn the most.

These sampling strategies are categorized as shown below :

  1. Committee-based Strategies
  2. Large margin-based Strategies
  3. Posterior probability-based Strategies

Committee-based Strategies

In committee-based strategies, we build several models and use these models’ predictions for determining the most informative data samples to be labeled. Here the set of models is referred to as a committee. This committee provides votes for particular unlabeled data samples. If we have n different models in the committee then we can have n predictions for one data sample. The data sample is considered as most informative if there is maximum disagreement in predictions from the committee. The disagreement in the committee of models can be measured by functions given below :

  • Entropy
  • KL-Divergence

The entropy measure is part of uncertainty sampling. In uncertainty sampling, we consider the data sample for querying when the models are not confident enough or least certain for labeling.

Image Credit (Settles, 2009)

In the equation above, V(yi) represents the number of votes for a particular label, and yi are possible labels. The entropy will be explained in detail in a further section. KL-Divergence represents the difference between two probability distributions. You can read more about this here.

In this category of strategies, three mainly used methods are compiled from many papers and represented here. Several ensemble approaches can be used for the construction of the committee. These methods differ in terms of the construction of the committee :

1. Query by Committee (QBC)

This is the same as the core concept of the committee based strategies. We create several distinct models with the same dataset. These different models can be of different structures or different hyperparameters or different algorithms altogether. For example, one model can be of SVM, the second model can be of Decision Tree, third can be of Logistic Regression, and so on… Now among this committee of different models, we measure disagreement in the predictions for a particular data sample.

Active learner determines to query the oracle for labeling a data sample if it produces most disagreement in terms of predictions. This can be measured by the methods shown above (Entropy and KL-Divergence). QBC can be also used for regression-based models, where disagreement refers to the “variance” among the committee members’ output predictions. High variance in the output predictions represents the most informative data samples.

2. Entropy-based Query by Bagging (EQB)

In the EQB approach, the method to construct a committee is different. In this approach, k training sets are created by using the “draw with replacement” policy from the original dataset. These k subsets of the original full dataset are fed to k different models for training. Then these models are used for predictions on the unlabeled pool of data samples.

For every unlabeled data sample, k labels are provided which is predictions by k different models. These predictions can be seen as probability-based output. Thus the heuristic used for measuring disagreement in this approach is entropy. Another extension of this approach is to have normalized values (normalized by the number of classes) for the measure of disagreement.

3. Adaptive Maximum Disagreement (AMD)

This strategy constructs the committee of models by splitting the feature space and provide each model the dataset with different features. This way we can have different models trained on the dataset having distinct features. This provides more divergence in the model’s discriminative ability. The measure of disagreement used here is the same as the previous strategies.

Another less known strategy for voting in the committee is “weight-based voting” or “weighted voting”. With each prediction of the model, a weight is associated which represents the effectiveness of the model’s prediction on the overall performance of the active learner. These weights are determined based on the accuracy or other performance parameters provided by the model.

Large margin-based Strategies

Large margin-based strategies are specifically designed for margin-based classifiers such as SVM. The distance to the separating hyperplane can be thought of as a good metric to measure the model’s confidence or certainty on unlabeled data samples. In SVM, the support vectors are those data samples that define the decision boundary. Therefore for an active learner, the most informative data samples would be the probable support vectors. There are several strategies in this category that can be modified to be applied to probability-based models also. These categories of strategies fall under uncertainty sampling strategies.

1. Margin Sampling (MS)

From the mathematical properties of the SVM algorithm, the support vectors are the labeled data samples that lie on the margin having the distance of exactly 1 from the decision boundary. Margin Sampling strategy assumes that the most informative data samples are those which fall within this margin and they are most likely to become new support vectors.

Image Credit (D. Tuia et al., 2011)

In the equation above f(xi, w) represents the distance of the data sample xi from the hyperplane for class w. One of the drawbacks of this method is that it can only select a single data sample for querying to oracle per iteration which increases the execution time.

2. Multiclass Level Uncertainty (MCLU)

The Margin Sampling method considers the data sample for which the prediction on one of the classes is most uncertain. That means we consider the data sample with minimum distance from the hyperplane. In contrast, the MCLU strategy focuses on a multi-class setting. For a data sample, the distances to the hyperplane are stored for each class. The classes having the largest and second-largest distance from the margin are identified. MCLU considers the difference between the distances of these two classes.

Image Credit (D. Tuia et al., 2011)
Image Credit (D. Tuia et al., 2011)

The equations above present the MCLU strategy in mathematical form. The second equation indicates the difference between the two highest distances belonging to two different classes. The goal is to minimize this difference so that we get the most uncertain and informative unlabeled data sample for training.

3. Margin Sampling-closest Support Vectors (MS-cSV)

To eliminate the drawback of Margin Sampling, MS-cSV was designed. This strategy first stores the positions of each data sample from the support vectors. This information will be useful in selecting the most informative data samples. We select one data sample per support-vector. For each support vector, a data sample is selected which has the lowest distance from that support vector. This way we can have more than one unlabeled data samples in every iteration removing the drawback of MS.

Posterior probability-based Strategies

This category of strategies for sampling is based on the estimation of class membership probabilities. The posterior probability distribution is used and evaluated for determining whether the unlabeled data sample should be queried for label or not. Unlike large margin-based strategies, this category of strategies can be used with any type of model which has the ability to predict output probabilities for class membership. The posterior probability distribution gives an idea about the model’s confidence and certainty in the assignment of a particular class to the data sample. The strategies that fall under this category are explained below.

1. Least Confidence (LC)

This strategy allows an active learner to select the unlabeled data samples for which the model is least confident in prediction or class assignment. For example, if we consider binary classification and the model predicts 0.90 probability value for class-0 and thus for class-1 the probability value will be 0.1, we can say that the model is confident with the class membership or class assignment for this particular data sample. But what if the model predicted 0.5 probability value for both the classes? For this, the model can not assign a class to the data sample because it is not confident with the class membership. Active learner aims at selecting this type of most informative and uncertain data samples for which the model has less confidence in the prediction. These data samples improve the model’s discriminative ability.

Image Credit (Settles, 2009)

In this equation, ŷ represents the highest posterior probability predicted by model 𝜃. Subtracting this probability from 1 and then maximizing it gives us the data sample with the highest uncertainty.

2. Best-versus-Second-Best (BvSB) or Smallest Margin

This strategy is similar to Margin Sampling in large margin-based strategies. The only difference is that this strategy applies to posterior probabilities and MS applies mainly to margin-based models. The least confidence strategy considers only the most probable class for evaluation. This strategy considers the two most probable classes i.e. the classes having the highest and second-highest probabilities.

Image Credit (Settles, 2009)

Here, ŷ₁ and ŷ₂ are the highest, and second-highest probabilities predicted using model 𝜃. We minimize this equation by taking argmin because the difference between the two most probable classes is indicative of the uncertainty model is having in the prediction of this data sample x. Therefore the lowest difference represents the most uncertain and informative sample.

Consider the first case where the two most probable class predictions are 0.5 and 0.4 and in the second case two most probable class predictions are 0.8 and 0.1, the sum of probabilities should be 1 so we assume that the remaining 0.1 probability in both the cases is distributed in the remaining classes. Now from these two cases, we can say that in the second case, the model is most confident (the difference is 0.8–0.1 =0.7). In the first case, the model is least certain for prediction (the difference is 0.5–0.4=0.1). Therefore minimizing the difference between the two most probable classes would be a natural choice.

3. Entropy

For classification with a large number of classes, above two posterior probability-based strategies do not fit to be optimal because they consider one or two most probable classes for sampling. The information lying in the remaining classes’ probability distribution is unused. To address this issue, the entropy measure is designed for sampling. In a general sense, the entropy can be considered as a measure of disorder or impurity in the system. In machine learning, it is widely used as a measure of uncertainty of a model. Higher entropy value indicates that the model is highly uncertain about the class membership.

Image Credit (Settles, 2009)

This is the equation for finding the entropy. Maximizing this function will give us the data sample for which the model is least certain or highest uncertain about the class assignment or membership.

There are some other sampling strategies also which are briefly described here. You can always refer to the classic literature of Active learning to gain more knowledge about these strategies.

Expected Model Change

In this strategy, the active learner focuses on those data samples more, which would bring the most change to the model. The active learner selects a data sample that would cause the greatest change to the model. The change in the model can be measured by the length of the training gradient (vector of parameter values).

Expected Error Reduction

The active learner tries to capture those data samples for labeling, which will reduce the error in the next iteration. The expected future error is estimated on the dataset which includes some data samples from the pool also. The data sample is selected for which the estimated error would be lowest.

Variance Reduction

This strategy focuses on selecting those data samples for labeling, which will reduce the model’s output variance. Variance is one of the components of the generalization error. Therefore minimizing the variance will indirectly minimize the generalization error also.

A significant amount of research is still going on in the field of Active Learning. Research in active learning is domain-dependent. Some strategies work in all the fields but some do not. Therefore, researchers are trying to design novel sampling strategies and generalize some well-performing ones. Active learning can be a cost-effective solution for labeling the samples when a large pool of unlabeled information is available. In the references section, I have listed the literature used for this article which will surely be a good read.

Thank you for reading. I hope you have enjoyed this article and gained knowledge from it.

References:

[1] B. Settles, Active Learning Literature Survey (2009), Computer Sciences Technical Report 1648, University of Wisconsin–Madison

[2] D. Tuia, M. Volpi, L. Copa, M. Kanevski and J. Munoz-Mari, A Survey of Active Learning Algorithms for Supervised Remote Sensing Image Classification (2011), IEEE Journal of Selected Topics in Signal Processing

[3] D. Tuia, F. Ratle, F. Pacifici, M. F. Kanevski and W. J. Emery, Active Learning Methods for Remote Sensing Image Classification (2009), IEEE Transactions on Geoscience and Remote Sensing

--

--

Hardik Dave

M. Tech. in Computer Science and Engineering. Interested in Deep Learning, Cloud Computing, and IoT.