Incremental (Online) Learning with Scikit-Multiflow

A practical introduction to incremental learning in Python using scikit-multiflow

Louis de Benoist

Published in

Towards Data Science

4 min readDec 25, 2019

Source: https://scikit-multiflow.github.io

Introduction

Data is all around us. Whether it’s profile pictures, tweets, sensor applications, credit card transactions, emails, or news feeds, data is here…and it’s being generated at incredibly fast speeds. With these seemingly infinite streams of data, one of the key challenges is to create lightweight models that are always ready to predict and adaptive to changes in the data distribution. The limitations of traditional machine learning methods in this setting has led to the development of online learning (also called incremental learning) methods.

In this post, we will gently introduce incremental learning through a practical implementation of a simple online classifier with scikit-multiflow, a Python framework for data stream learning.

What is Incremental Learning?

At every iteration, the model predicts a class label, reveals the true label, and is then updated

Incremental learning refers to a family of scalable algorithms that learn to sequentially update models from infinite data streams¹. Whereas in “traditional” machine learning, we’re given a complete dataset consisting of (input, output) pairs, in incremental learning, we don’t have all of the data available when creating the model. Instead, the data points arrive one at a time and we have to build a “living” model, one that learns and adapts as the data comes. An incremental model has the following characteristics²:

It can predict at any time
It can adapt to concept drift — i.e. changes in the data distribution⁴. To give a concrete example, if we’re interested in building a model that predicts how much money a bank should loan, a financial crisis might alter the amounts or the factors that need to be considered. In this case, the model needs to re-learn a lot of information.
It is able to process an infinite data stream with finite resources (time and memory). This means that it cannot store all of the training data as in typical machine learning approaches.

Working with Data-Streams in Python

Now that we’ve talked about what incremental learning is, let’s work out a simple example in Scikit-Multiflow, a free Python framework for data-stream learning.

The first thing that we want to do is to install scikit-multiflow.

pip install -U scikit-multiflow

Importing a data generator is easy and can be done with the following command:

from skmultiflow.data import SEAGenerator

Here, we’re going to work with the SEA generator, but there are many other available options (see the documentation for details: https://scikit-multiflow.github.io/scikit-multiflow/ ). The SEA generator allows you to generate an infinite data stream with 6 inputs and 2 outputs. This particular data stream contains frequent, abrupt concept drift.

Using the generator is quite easy. The first thing we need to do is initialize it as follows:

stream = SEAGenerator()      # create a stream
stream.prepare_for_use()     # prepare the stream for use

Then, if we wish to obtain a data sample, all we need to do is

X,Y = stream.next_sample()

where X , the input, is a 6 dimensional np.array and Y, the output, is a 2 dimensional np.array.

Simple Online Classifier

Now, let’s create a simple classifier for the SEA data stream. There are many incremental models available with scikit-multiflow, one of the most popular being Hoeffding Trees.

Hoeffding Trees

Hoeffding trees³ are built using the Very Fast Decision Tree Learner (VFDT), an anytime system that builds decision trees using constant memory and constant time per example. Introduced in 2000 by Pedro Domingos and Geoff Hulten, it makes use of a well known statistical result, the Hoeffding bound, in order to guarantee that its output is asymptotically identical to that of a traditional learner.

In scikit-multiflow, creating a Hoeffding Tree is done as follows

from skmultiflow.trees import HoeffdingTreetree = HoeffdingTree()

Training a Hoeffding Tree for Classification

If we want to train the tree on the SEA data stream, we can just loop through however many data points we want.

correctness_dist = []for i in range(nb_iters):
   X, Y = stream.next_sample()        # get the next sample
   prediction = tree.predict(X)       # predict Y using the tree        if Y == prediction:                # check the prediction
     correctness_dist.append(1)
   else:
     correctness_dist.append(0)   
   
   tree.partial_fit(X, Y)             # update the tree

Using “correctness_dist”, an array of ones and zeros depending on whether the learner accurately classified the incoming sample, we can plot the accuracy over time

import matplotlib.pyplot as plttime = [i for i in range(1, nb_iters)]
accuracy = [sum(correctness_dist[:i])/len(correctness_dist[:i]) for i in range(1, nb_iters)]plt.plot(time, accuracy)

Accuracy over time for a Hoeffding tree modeling the SEA generator

Alternative Approach with Scikit-Multiflow

In scikit-multiflow, there is a built-in way to do the exact same thing with less code. What we can do is import the EvaluatePrequential class:

We can then set up an “evaluator” as follows

evaluator=EvaluatePrequential(show_plot=True,max_samples=nb_iters)

Setting the show_plot=True option will allow a pop up to appear with a real time plot of the classification accuracy.

Now that the evaluator is set up, we can use it to incrementally train our Hoeffding Tree on the SEA data stream, in the same way as before:

evaluator.evaluate(stream=stream, model=tree)

Conclusion

Hopefully, this tutorial has helped you understand the basics of incremental learning. Moreover, I hope that you now grasp how to use scikit-multiflow for basic data-stream learning tasks.

References

[1] Doyen Sahoo et al, “Online Deep Learning: Learning Deep Neural Networks on the Fly” (2017), 1711.03705

[2] Jesse Read et al, “Batch-incremental Versus Instance-incremental Learning in Dynamic and Evolving Data” (2012), 978–3–642–34156–4_29

[3] Pedro Domingos and Geoff Hulten, “Mining High-speed Data Streams” (2000), 347090.347107

[4] Maayan Harel et al, “Concept Drift Detection Through Resampling” (2014), citation.cfm