The world’s leading publication for data science, AI, and ML professionals.

Optimize annotation and training : an Online-Active Learning Framework

Learn how to implement methods to streamline all the steps of your machine learning project and get ready for production.

Image by author
Image by author

Streamline your machine learning projects for production.

At Picsell.ia, we are devoted to provide the most efficient way to build a computer vision system at low cost.

As everyone know, the labeling and training costs for deep learning are high, but over the past few years, we saw the emergence of interesting techniques to make it more affordable.

Today, I’ll talk about 2 of them:

This will be a bit long, so grab a cup of coffee and get ready to dive behind the scenes of Picsell.ia platform with me.

Please note that our Online-Active Learning framework is released on Github here : https://github.com/PicselliaTeam/Online-Learning

Basic concept of Active Learning

Settles, 2009
Settles, 2009

A classic supervised model usually gets more accurate with more labelled data. However annotating all those data can take a lot of time and/or money. Depending on your goal, budget and time constraints you might consider reducing the number of data to label.

The whole point of Active Learning is to smartly choose which data needs to be labelled and in what order. Lots of research papers have already proven that Active Learning methods can reduce the amount of training data required to reach a specific accuracy.

I’ll be concise on the fundamentals of Active Learning since there are more than enough articles describing the process probably better than I am able to.

Active Learning: Your Model’s New Personal Trainer

An Introduction to Active Learning

The Active Learning loop is composed of three main steps.

  • The first one is to annotate a subset of our Dataset.
  • Then we want to train a model on those labelled data.
  • Lastly we use the model trained on the subset to make predictions on the unlabeled pool of data.

With those predictions we use a sampling method to make a query to the annotator (called "Oracle" in the literature).

The most common set of sampling methods used are called "Uncertainty Sampling". Those methods will focus on the decision boundaries of your model and select data where it is the most uncertain.

Interactive Uncertainty Sampling HeatMap by Robert Munro
Interactive Uncertainty Sampling HeatMap by Robert Munro

But Uncertainty Sampling alone is not always the most relevant manner of selecting data since you are basically ignoring the rest of the unlabeled space which is far away from the decision boundaries.

This is often called "Exploitation" in the literature, however you also need to throw in some "Exploration" of your data.

While your model is updating itself at each training iteration, the decision boundaries are also moving which means that you are doing a bit of Exploration. But adding in proper Exploration methods is important.

The most common method to explore the Dataset is simply to mix "Uncertainty Sampling" and "Random Sampling". With this method we choose to Exploit the data (Uncertainty Sampling) with a probability p and to Explore(Random Sampling) with a probability 1-p.

More complex and effective Exploration methods exist, such as Cluster-Based Sampling but we are not going to explicit them today.

The balance between Exploration and Exploitation is a common dilemma in the field.

The basic concept is quite simple but the implementation can be tricky :

  • Which sampling method should I pick ?
  • How to find the balance between the Exploration and Exploitation of the dataset ?
  • What should be the number of items to send back to the oracle ?
  • What should the oracle do while waiting for the query ? …

All those questions do not have an ultimate answer. For example, you may want your model to adapt in real-time to the annotations… or maybe you are labeling a batch of data once every year so you don’t need to update your model oftenly.

The main practical consideration when looking at real-world utilization of active learning is :

"Will I save time and money with active learning ?"

Indeed, we already know that we can learn with less data, but the implementation of active learning can easily lead to a waste of time.

  • Retrieving the labelled data
  • initializing the training
  • the training itself
  • predicting on the unlabelled data and lastly making the queries can actually be more time consuming than just annotating all of your data.

Depending on the speed at which you want your model to adapt to new training data, the time you loose there can be deal-breaker.

As I said earlier, it really depends on the frequency at which you update your model.

The thing is, if you have a very large Dataset to annotate and can’t do it all, or if you want your model to learn continuously with fresh data from the real world, then you may want to use "Online learning" which is a big word to say real-time learning, or at least close to.

Online learning

The principle of Online Learning is to train your model as soon as new training data are available, without re-initializing all the variables at each iteration and always keeping it in memory.

This is kind of the opposite of the standard approach where you have all you training data ready before launching training. With this approach your model can adapt in real-time to new data (or almost if you decided to do Batched Online Learning).

The main challenge of real time learning is the implementation more than the concept.

The drawbacks

Online learning is prone to forgetting previous data if you don’t train on the whole Dataset at each iteration.

Indeed, if you only train on new examples, it’s like doing transfer learning every time with the weights obtained with previous data. Which gives less and less weight to old data with the time.

The danger in the other way around is that if you train on all of your data every time, at some point the size of your dataset can’t be handled anymore, but also your model will consider that your old data is still as relevant as your new data, which could lead to big issues in the real world.

It can also be hard to scale horizontally on multiple GPU instances.

But once you have it all set up (and we are here to help), you will see the outstanding benefits really fast !

Combining Active learning with Online learning

It’s time to sum up a little what we have seen during this article :

  • Active Learning allows us to make queries on our unlabelled dataset in order to reduce the amount of data needed to reach specific performance thresholds.
  • We also saw that we can implement Online Learning to adapt our model to a stream of incoming data in real-time.

What about combining both ?

All the questions around active learning we raised before still apply. But this would lead to more questions like :

  • "How much data to buffer before sending them to the model ?"
  • "Should I train on the whole Dataset every time or just the newly received data ?"
  • "Should I continuously train and dynamically add in new data or wait for new data to come after each iteration ?"

That’s why we wanted to propose a small open-source framework to experiment with Online-Active Learning (on Computer Vision) and will prevent you from getting a lot of headaches while trying to answer those questions !

Our framework

It’s a small web-app with two servers, one with a minimal labeling app for classification and active learning in the back, the other that train the model.

Image by author
Image by author

Only image classification is currently supported, but we will add an object detection backend soon ! (When the Tensorflow Object Detection API is updated to TF2 – milestone : 8 July 2020)

The idea is to train and make active learning queries while always annotating data. By streamlining the process with our framework, you can never lose time even if the active learning strategy isn’t successful for example. But if you manage to experiment with our tools and find the good strategy, you can save up to 5x time on your whole project !

You can play with different parameters to tune your Online-Active Learning strategy such as :

  • buffer size before sending new data
  • the model
  • the sampling method…

But you can also switch between standard Uncertainty Sampling methods.

We will add more to the repo but you can also define yours easily.

Here is what you can currently find :

  • Entropy sampling
  • Least confidence sampling
  • Margin of confidence sampling
  • Ratio of confidence sampling
  • Uncertainty + random sampling

We added in a simple function allowing to randomize an Uncertainty Sampling method with a probability p for Exploitation.

With this framework, our hope is that anyone can learn how to train models efficiently and maybe answer the questions we have raise earlier, but also that it gives you some intuition on how to better choose parameters for different use-cases.

If you want to contribute are give us feedback on the framework, you can add your comment here are write a Pull Request on Github. We would be glad to bring this further with your help !

In the meantime, follow us for more articles on Machine Learning or Computer Vision, join Picsell.ia and access our Open CV Platform with a bunch of free datasets and pre-trained models !

Resources

[1] Robert Munro, __ Human in the loop Machine Learning (2021)

[2] A. Bondu, V. Lemaire, M. Boullé, Exploration vs. exploitation in active learning : a Bayesian approach (2010)

[3] __ SurveyBurr Settles, Active Learning Literature (2010)


Related Articles