How Active Learning can help you train your models with less Data

Charles Brecque
Towards Data Science
4 min readOct 9, 2018

--

Even with massive computational resources, training a Machine Learning model on large data sets can take hours, days and some times weeks which is expensive and a burden on your productivity. However, in most cases you do not need all the available data to train your models. In this article, we are going to compare data subsetting strategies and the impact they have on the performance of the models (training time and accuracy). We will implement them on the training of a SVM classifier on sub-sets of the MNIST data set.

Building subsets with Active Learning

We will use Active Learning to build subsets of our Data.

Active Learning is a special case of Machine Learning in which a learning algorithm is able to interactively query the user to obtain the desired outputs at new data points¹.

The process of subsetting the data is done with an Active Learner which is going to learn based on a strategy, which training subsets are appropriate for maximising the accuracy of our model. We are going to consider 4 different strategies for building these subsets of data from the original training set:

  • Random sampling: the data points are sampled at random
  • Uncertainty sampling: we select the points whose class we are most uncertain about.
  • Entropy sampling: we choose the points whose class probability have the largest entropy
  • Margin sampling: we choose the points for whom the difference between the most and second most likely classes are the smallest.

The probabilities in these strategies are associated with the predictions of the SVM classifier.

For this study we are going to build subsets of 5,000 (8% of the data); 10,000 (17% of the data) and 15,000 (25% of the data) points from the original training set of 60,000 points.

Results

To measure the performance of our training on the subsets we will measure the training accuracy and training time ratios calculated the following ways:

We can calculate the same ratios for the Test data sets. The results are summarised in the following graphs. The 3 data points for each strategy correspond to the size of the subset (5,000 ; 10,000 and 15,000).

As we can see, using the uncertainty sampling strategy, we can achieve over 99% of the performance with a subset of 15,000 points in only 35% of the time it took us to train the SVM on the full dataset. This clearly shows that we can achieve comparable results to using the full data set but with only 25% of the data and in 35% of the time. Random sampling is the fastest of all strategies, but also the worst in terms of accuracy ratios.

Working on subsets of data is therefore a reasonable approach for significantly reducing training time with less computation and without compromising accuracy. Subsetting the data works well on most classification data sets but will require extensions to be applicable to Time Series data and the models you are training.

How much of the data do we need?

Now that we have proven the value and feasability of training models on subsets of data how can we know what the optimal subset size should be? One approach, called FABOLAS² [Klein et al.] and implemented here can recommend the size of the subset you should use. It does this by learning a relationship between the contextual variable (size of the data set to use) and the reliabilty of the final score achieved. This means that by training the model on a subset, it can extrapolate the performance of the model on the full data set.

Extensions with Bayesian Optimization

If we would like to go even further, we can optimise the training of the hyper-parameters on the subsets more efficiently by using Bayesian Optimization. I have written extensively about it in previous posts:

At Mind Foundry we are striving for optimal and efficient Machine Learning through Bayesian Optimization and Active Learning. If you have any questions or would like to try our products, feel free to email me!

[UPDATE: I have started a tech company. You can find out more here]

1: https://en.wikipedia.org/wiki/Active_learning_(machine_learning)

2: Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, Frank Hutter, Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets, arXiv:1605.07079 [cs.LG]

--

--