Incremental (Online) Learning with Scikit-Multiflow
A practical introduction to incremental learning in Python using scikit-multiflow
Introduction
Data is all around us. Whether it’s profile pictures, tweets, sensor applications, credit card transactions, emails, or news feeds, data is here…and it’s being generated at incredibly fast speeds. With these seemingly infinite streams of data, one of the key challenges is to create lightweight models that are always ready to predict and adaptive to changes in the data distribution. The limitations of traditional machine learning methods in this setting has led to the development of online learning (also called incremental learning) methods.
In this post, we will gently introduce incremental learning through a practical implementation of a simple online classifier with scikit-multiflow, a Python framework for data stream learning.
What is Incremental Learning?
At every iteration, the model predicts a class label, reveals the true label, and is then updated
Incremental learning refers to a family of scalable algorithms that learn to sequentially update models from infinite data streams¹. Whereas in “traditional” machine learning, we’re given a complete dataset consisting of (input, output) pairs, in incremental learning, we don’t have all of the data available when creating the model. Instead, the data points arrive one at a time and we have to build a “living” model, one that learns and adapts as the data comes. An incremental model has the following characteristics²:
- It can predict at any time
- It can adapt to concept drift — i.e. changes in the data distribution⁴. To give a concrete example, if we’re interested in building a model that predicts how much money a bank should loan, a financial crisis might alter the amounts or the factors that need to be considered. In this case, the model needs to re-learn a lot of information.
- It is able to process an infinite data stream with finite resources (time and memory). This means that it cannot store all of the training data as in typical machine learning approaches.
Working with Data-Streams in Python
Now that we’ve talked about what incremental learning is, let’s work out a simple example in Scikit-Multiflow, a free Python framework for data-stream learning.
The first thing that we want to do is to install scikit-multiflow.
pip install -U scikit-multiflow
Importing a data generator is easy and can be done with the following command:
from skmultiflow.data import SEAGenerator
Here, we’re going to work with the SEA generator, but there are many other available options (see the documentation for details: https://scikit-multiflow.github.io/scikit-multiflow/ ). The SEA generator allows you to generate an infinite data stream with 6 inputs and 2 outputs. This particular data stream contains frequent, abrupt concept drift.
Using the generator is quite easy. The first thing we need to do is initialize it as follows:
stream = SEAGenerator() # create a stream
stream.prepare_for_use() # prepare the stream for use
Then, if we wish to obtain a data sample, all we need to do is
X,Y = stream.next_sample()
where X , the input, is a 6 dimensional np.array and Y, the output, is a 2 dimensional np.array.
Simple Online Classifier
Now, let’s create a simple classifier for the SEA data stream. There are many incremental models available with scikit-multiflow, one of the most popular being Hoeffding Trees.
Hoeffding Trees
Hoeffding trees³ are built using the Very Fast Decision Tree Learner (VFDT), an anytime system that builds decision trees using constant memory and constant time per example. Introduced in 2000 by Pedro Domingos and Geoff Hulten, it makes use of a well known statistical result, the Hoeffding bound, in order to guarantee that its output is asymptotically identical to that of a traditional learner.
In scikit-multiflow, creating a Hoeffding Tree is done as follows
from skmultiflow.trees import HoeffdingTreetree = HoeffdingTree()
Training a Hoeffding Tree for Classification
If we want to train the tree on the SEA data stream, we can just loop through however many data points we want.
correctness_dist = []for i in range(nb_iters):
X, Y = stream.next_sample() # get the next sample
prediction = tree.predict(X) # predict Y using the tree if Y == prediction: # check the prediction
correctness_dist.append(1)
else:
correctness_dist.append(0)
tree.partial_fit(X, Y) # update the tree
Using “correctness_dist”, an array of ones and zeros depending on whether the learner accurately classified the incoming sample, we can plot the accuracy over time
import matplotlib.pyplot as plttime = [i for i in range(1, nb_iters)]
accuracy = [sum(correctness_dist[:i])/len(correctness_dist[:i]) for i in range(1, nb_iters)]plt.plot(time, accuracy)
Alternative Approach with Scikit-Multiflow
In scikit-multiflow, there is a built-in way to do the exact same thing with less code. What we can do is import the EvaluatePrequential class:
We can then set up an “evaluator” as follows
evaluator=EvaluatePrequential(show_plot=True,max_samples=nb_iters)
Setting the show_plot=True option will allow a pop up to appear with a real time plot of the classification accuracy.
Now that the evaluator is set up, we can use it to incrementally train our Hoeffding Tree on the SEA data stream, in the same way as before:
evaluator.evaluate(stream=stream, model=tree)
Conclusion
Hopefully, this tutorial has helped you understand the basics of incremental learning. Moreover, I hope that you now grasp how to use scikit-multiflow for basic data-stream learning tasks.
References
[1] Doyen Sahoo et al, “Online Deep Learning: Learning Deep Neural Networks on the Fly” (2017), 1711.03705
[2] Jesse Read et al, “Batch-incremental Versus Instance-incremental Learning in Dynamic and Evolving Data” (2012), 978–3–642–34156–4_29
[3] Pedro Domingos and Geoff Hulten, “Mining High-speed Data Streams” (2000), 347090.347107
[4] Maayan Harel et al, “Concept Drift Detection Through Resampling” (2014), citation.cfm