The world’s leading publication for data science, AI, and ML professionals.

How Much Time Can You Save With Active Learning?

A hands-on experiment on an NLP dataset.

How to Label Training Data More Effectively

Saving Time and Effort With Active Learning

If you’ve ever taken part in any machine learning project starting from scratch, you probably know that a significant amount of time is usually spent on labelling the data. The number and quality of labeled data often determines the next course of the project, as well as its final outcome.

A traditional method of choosing which data to label is to simply take a random sample of size for which you have the labelling capacity.

In this article, I would like to explore a more effective approach of choosing data for labelling called Active Learning, based on an idea that some data points might bring more information value to the model than the others.

Photo by NeONBRAND on Unsplash
Photo by NeONBRAND on Unsplash

Active learning

The main principle of active learning is to let the model choose which data instances should be labeled by the human annotator (often referred to as an oracle), which results in a better performance with less labeled data overall.

The most common scenario of letting the model do this is called pool-based sampling, where a large pool of unlabeled data is available. From this pool, the model draws the instances to be labeled using a certain query strategy. After the queried instances are labeled, the model is retrained with the newly added data, and the process repeats again.

Probably the simplest query strategy, and the one that I used for the experiment, is uncertainty sampling. This strategy queries the instances for which the model’s predictions are most uncertain. If we imagine a binary classification problem, these would be the instances whose posterior probability is closest to 0.5.

In a multi-class classification problem, there are several ways of doing the uncertainty sampling:

Least confidence sampling selects the instances for which the most likely label has the least confidence (its probability is furthest from 1.0).

Margin sampling selects the instances with the smallest difference between the most probable and the second most probable label.

Entropy sampling selects the instances with the highest entropy (counted on probabilities using [this formula](https://en.wikipedia.org/wiki/Entropy(informationtheory))).


Experiment

I decided to make an experiment comparing all three multi-class approaches, together with the basic random sampling. For this, I used DBPedia Classes Dataset, which is a common benchmark dataset for multi-class text classification.

From the dataset, I randomly selected 5000 instances into our pool of unlabelled data. These data were about to be queried using the different strategies, and labeled afterwards. The labelling itself did not have to be actually performed, since the dataset already contains the labels. The labels were, therefore, just simply assigned to the instances when needed.

Next, I created a simple model with one convolutional layer, taking GloVe word embeddings on the input. I needed the model to be very lightweight, since it will be retrained repeatedly after every querying.

Active learning needs to always start with a small portion of data already labeled, so the model is at least a little bit trained before it actually starts querying. This small dataset is called the seed. I set up the seed of 100 instances which have been randomly selected from our pool, labeled, and used for every strategy.

The next parameter is a batch size, which in the context of active learning means how many instances should be queried and labeled at once, before the model retrains. Ideally, we would use the batch size of 1 so the model retrains after every labeled instance, and therefore, maximizes the effectivity of querying. This is usually not possible, since the training itself is computationally demanding and takes a certain time. In this experiment, I used the batch size of 50.

Altogether, I did 98 iterations with this batch size, which together with the seed of 100 instances runs through the whole pool that we have set up. Every iteration, the accuracy of the model has been measured on a separate test set. The results for every strategy are shown in the following graph:

Accuracy given the amount of labeled data, using different query strategies. Image by author.
Accuracy given the amount of labeled data, using different query strategies. Image by author.

As you can see, all three query strategies performed in a very similar way, and all did better than random sampling. We can also see that after about 4500 labeled instances, all query strategies (including random sampling) settle around the same accuracy. This is simply because we are getting to the end of the pool, where almost all the data are labeled regardless the strategy.

Let’s now try to transform the results into a real life scenario. If the goal of our project would be to reach the accuracy of e.g. 85%, then it could be achieved by labelling either 1900 instances using random sampling, or by labelling 1150 instances using least confidence sampling. This would mean 39.5% less time spent on labelling the data.

Another way how the scope of the project could be defined, is that we have a capacity to label e.g. 2000 data instances. This way, we would reach the accuracy of 85.96% using random sampling, and 88.4% using least confidence sampling, which would mean 2.44% increase in accuracy by labelling the same amount of data.


Conclusion

The results seem to be quite promising, however, we must be aware that they might differ from dataset to dataset. It is also worth mentioning that active learning might come with some disadvantages.

One of them is that setting the infrastructure for active learning in the beginning of the project might take some time (although some frameworks are already available), and the process of retraining the models can be computationally demanding.

The other is that the distribution of labeled data you end up with is biased towards the model that has been used for querying. If you now for some reason decide to change the model, then the data are no longer coming from the random distribution, which might be worse for your new model than if they actually did.

Overall, I hope that I gave you a nice overview of what to expect from active learning, and that you will take it into consideration when starting your next project.

The code for the experiment is available in this kaggle notebook.

Inspired by Active Learning Literature Survey.

Thank you for reading!

Systematically Tuning Your Model by Looking at Bias and Variance

Zero-Shot Learning the Alphabetic Characters (an experiment with code)


Related Articles