Machine learning for streaming data with creme

Online machine learning can change the way you think about data science in production

Max Halford
Towards Data Science

--

A stream of data and some decision trees
Photo by Joao Branco on Unsplash

Motivation

Deploying machine learning models into a production environment is a difficult task. Currently, the common practice is to have an offline phase where the model is trained on a dataset. The model is afterwards deployed online to make predictions on new data. Therefore, the model is treated as a static object. In order to learn from new data, the model has to be retrained from scratch. This deployment pattern is sometimes referred to as the lambda architecture.

Model lifecycle with the lambda architecture, source: The Benefits of Online Machine Learning

The lambda architecture is so ubiquitous that it has transpired throughout the industry. In fact, AutoML platforms such as DataRobot and H2O do exactly this. You give them data, and they provide you with an API behind which a model is sitting. The same goes for tools with less automation such as MLflow, Amazon Sagemaker, Cubonacci, and Alteryx. This is also the approach taken by lesser known projects such as Cortex and Clipper. I’m not knocking these tools; in fact I all find them wonderful, and they’re providing healthy competition to each other.

Machine learning models are deployed in such a manner because they are limited by the learning capacity of the model. Indeed, machine learning algorithms usually assume that all the training data is available at once. This is referred to as batch learning in the litterature. Historically, statisticians and ML researchers have mostly been interested in “fitting” a model to a dataset and leave it at that. Only a small subset of research has delved into designing models that are able to update themselves whenever new data is available.

Wouldn’t it be nice to have models that are able to learn incrementally? Think about it: take a student who goes to classes at college every day. When a class begins, the student has accumulated knowledge from all past classes. The teacher doesn’t start back from the first class every time he wants his students to learn a new idea. That would be crazy, right? Well, that’s exactly what a lot of machine learning practictioners are doing right now.

It turns out that batch learning has a lesser known sister called online learning. As you might have guessed, online learning encompasses all models which are able with one observation at a time. Online learning is a paradigm shift because it allows to change the way we think about machine learning deployment. In addition to making predictions for new data, an online model is also able to learn from it. Indeed, the predictions and learning phases of an online can be interleaved into what is sometimes called the kappa architecture.

Model lifecycle with the kappa architecture, source: The Benefits of Online Machine Learning

With the kappa architecture, the data is treated as a stream. Once a model has been updated with a new piece of data, then that piece of data can effectively be discarded. In other words, you don’t have to store a historical training set and retrain your model every so often. Another benefit is that your model is always up-to-date. The natural consequence is that your model is able to deal with concept drift, which happens when the data’s distribution evolves as time goes on. Moreover, the model lifecycle is easier to think about because the learning and prediction steps both process one input at a time. Meanwhile batch models have to be retrained as often as possible to cope with concept drift. This requires contant monitoring and makes maintenance more difficult. With batch learning you also have to ensure that the features you’ve used for training the model are also accessible in production.

Batch machine learning is great, and it works fine in many cases. However, online machine learning is a more adequate solution for some usecases. It just makes sense for applications where new data is constantly arriving: spam filtering, recommender systems, IoT sensors, financial transactions, etc. As we just mentioned, online models especially shine when the patterns in the data are evolving and require the model to adapt. In my experience, many practioners try to fit a square peg into a round hole: batch machine learning isn’t made to be used in a streaming environment. I’ve seen batch models plummet in production because they were not able to learn from new data.

Using an online machine learning model can also lower your operational costs, both in terms of compute power and human intervention. First of all, you don’t need powerful hardware in order to process a streaming dataset, because only one element of the stream lives in memory at a time. In fact, online machine learning can effectively be deployed on a Raspberry Pi. Secondly, your data scientists can spend less time dealing with model retraining because model updates are baked into online machine learning, and can therefore spend more time on tasks with more added value.

So why aren’t we all using online machine learning? Well, some of us are. Big players — you know who I mean — are using online models for ad click-through rate prediction as well as news feed recommendations. Online machine learning just isn’t a mainstream notion. The culprits are easy to find. Kaggle competitions present data science as a scientific experiment where the data is split into a training set and a test set. University courses and MOOCs teach students the most popular machine learning algorithms, which are mostly batch algorithms. Finally, popular libraries such as scikit-learn encourage users to think in terms of rectangular datasets that hold in memory, not in terms of streaming data.

In my honest opinion, the main reason why people aren’t using online machine learning is simply because there isn’t a lot of available tooling to do so. There are great tools to analyze streaming data, such as Samza and Flink, but they don’t allow to do any serious machine learning. There are also tools to learn from large datasets, such as Spark, Vowpal Wabbit, Dask, and Vaex. However, these tools don’t fully embrace the online paradigm, and mostly still assume that you are working with a static dataset, albeit one that doesn’t fit in memory.

Introducing the creme library

I discovered online machine learning some 15 months ago. Since then it’s been a bit of a revelation and I’ve been working hard to share my view. In order to do so, I started writing a Python library called creme. The name comes from incremental learning, which is a synonym for online learning. Next, some friends of mine joined me: Geoffrey Bolmier (who is currently being payed to work on scikit-learn), Raphaël Sourty, Robin Vaysse, and Adil Zouitine. We’ve recently released version 0.5 and have started to see it being used in a few companies for proof of concepts.

creme is a general-purpose library, and as such covers many areas of machine learning, including feature extraction and model selection. You can think of creme as the scikit-learn of online machine learning — we realize how ambitious that may sound.

As a very simple introductory example, let’s say you want to compute the average of a sequence of values. The natural way to proceed is to accumulate all the values and divide the total by the number of values. However, there is an online algorithm which is exact and doesn’t even need to know the number of values before it starts. It’s called the West algorithm and is trivial to implement. An online mean can be computed in creme as so:

Running average in creme

Calculating a running average might not seem like much of as feat, but it opens the door to more advanced techniques such as target encoding:

Target encoding in creme

You’ll notice in the previous snippet that TargetAgg has two methods: fit_one, which updates the model, and transform_one, which converts the input into features. In this case, TargetAgg has a transform_one method because it’s a transformer (duh). Other estimators, such as linear_model.LinearRegression, have methods fit_one and predict_one. Classifiers, such as tree.DecisionTreeClassifier, also have a predict_proba_one method.

The other thing to know with creme is that features are stored inside dictionaries. Dictionaries are practical because they are implicitely sparse: 0 values are simply not included in the dictionary. This allows to scale to large datasets with millions of features. Moreover, dictionaries do not have a fixed length, which means that creme models will function even if you add or remove features. Finally, using dictionaries means that each feature has a name, which is a human-friendly design. Note that we don’t use numpy or pandas, although creme supports both formats. Instead, we mostly rely on Python’s standard library, which provides great native support for working with dictionaries. As we will see later on, this doesn’t impact performance, quite the contrary in fact.

Now for the serious stuff: let’s train a classifier on a stream of data. As an example, we’ll use the Website Phishing dataset, which describes web pages and indicates if they are phishing attempts or not. We’ll use a logistic regression from the linear_model module. We’ll measure the performance of the model by using an instance of metrics.Accuracy, which, as you already have guessed, can be updated online.

Learning on the Website Phishing dataset

As you can see in the above snippet, the prediction and learning steps are interleaved. In this case, we’re streaming over a dataset in sequential order, but you have full control on how to set this up. By using online machine learning, and in particular creme, you’re not hindered anymore by the fact that you need all the data at once in order to train a model. The other thing to notice is that we’ve used a pipeline to compose a standard scaling step with the logistic regression. Pipelines are a first-class citizen in creme, and we strongly encourage their use. You can check out our user guide for more examples. In creme, pipelines have convenience methods such as model.draw() :

Graph representation of the Website Phishing pipeline

Note that not all algorithms have an online equivalent. For instance, kernel SVMs are impossible to fit on a streaming dataset. Likewise, CART and ID3 decision trees can’t be trained online. However, lesser known online approximations exist, such as random Fourier features for SVMs and Hoeffding trees for decision trees. creme has many things to offer, and we strongly encourage you to check out it’s GitHub page and the online documentation. Here are a few highlights:

  • The reco module contains algorithms for recommender systems, much like surprise, but online.
  • Time series forecasting, with the time_series module.
  • You can do online model selection with SuccessiveHalvingRegressor and SuccessiveHalvingClassifier from the model_selection module. We also have plans for integrating multi-armed bandit algorithms.
  • Online imbalanced learning.
  • Progressive validation, which is the online counterpart of cross-validation.
  • Decision trees and random forests.
  • Linear models with a wide array of optimizers.

The main criticism of online learning is that it is slower than batch learning. Indeed, because the data is processed one element at time, we can’t make the most of vectorization. One solution is mini-batching, which is effectively a compromise between online learning and batch learning. However, with creme we have decided to only work with one element at a time because it is easier to reason about and drastically simplifies model deployment. Because of this design choice, creme is in fact much faster than scikit-learn, Keras, and Tensorflow in an online learning scenario. The following table summarizes the results for a linear regression implemented in different libraries:

Online linear regression benchmarks

You can find the code used to run the benchmarks here. This doesn’t mean that creme can process a dataset faster than other libraries, it means that creme is faster when the data is processed one element at a time. Other libraries cater to batch learning scenarios, not online learning. Depending on your data science experience, this might seem like a niche case. However, processing data in such a manner makes model deployment and maintenance a breeze in comparison with batch learning.

Deployment

Once you’ve designed an online machine learning model, deploying it is relatively straightforward. Assuming that you want to interact with your model via web requests, then the API you have to implement is very simple. Essentially, you need a route to update the model, and another route to make predictions. You also might want to implement some routes to monitor the performance of the model. That’s really it: you don’t need to think about model retraining and worry about your model not being enough up-to-date. That’s the beauty of the lambda architecture. Treating your data as a stream brings many benefits to the table. On top of that, creme is a great fit because it consumes dictionaries, which easily translates to and from JSON, which is the lingua franca of APIs.

To demonstrate how simple it is to deploy a model built with creme, we’ve implemented a small tool called chantilly. This tool is nothing more than a Flask app which provides the necessary routes for uploading a model, making it learn, obtaining predictions, and monitoring metrics. It also provides a live dashboard which is currently very plain. chantilly is still in it’s infancy, and we have plans to make it even more enjoyable and simple to use.

The chantilly repository contains an example based on the New York City taxis dataset. The dataset contrains taxi trips details along with their duration. The goal is to predict how long the trip will take. In a nutshell, we are able to take this historical dataset and simulate a stream of events in exactly the same way as it happened. Therefore we are able to establish a trustworthy benchmark and a reliable idea of our model would have performed. This is very important: being able to correctly reproduce a production scenario helps to build trust in your model. With batch models you’re never quite sure how your model is going to perform. Sure you can do cross-validation, but it isn’t exactly mimicking a production scenario locally.

Note that we don’t intend for chantilly to be a one-size-fits-all tool. We realize that every data science team has it’s own particular way of doing model deployment. Therefore, chantilly is mostly intended to be a simple and readable example of how to implement the lamda architecture for machine learning models. Nonetheless, we will keep working on chantilly and provide ways to co-operate with existing infrastucture. For instance, chantilly currently provides a streaming API to monitor metric updates. This allows you to write a short script to send these updates to, say, an InfluxDB database which is connected to your Grafana dashboard.

Going further

creme is a young library, and yet we’re already getting some great feedback from data science teams who are using it for proof of concepts. Here are the key take-aways:

  • Online models can learn with one instance at a time.
  • Online models do not have be retrained, as they learn on the fly.
  • Online machine learning simplifies model deployment and maintenance.

If you want to know more about creme, we encourage you to check out the GitHub repository as well the user guides and the API reference. Starring us on GitHub also helps us gain exposure to a wider audience.

We’re very interested in onboarding new contributors. If creme has peaked your interest and you want to be part of the team, then feel welcome to get in touch!

We’re also happy to (freely) work hand-in-hand with data science teams to see if creme suits your needs, and in what ways we can improve it.

--

--