The world’s leading publication for data science, AI, and ML professionals.

Machine Learning, Illustrated: Incremental Learning

How models learn new information over time, maintaining and building upon previous knowledge

Welcome back to the Illustrated Machine Learning series. If you read the other articles in the series, you know the drill. We take a (boring sounding) machine learning concept and make it fun by illustrating it! This article will cover a concept called Incremental Learning, where machine learning models learn new information over time, maintaining and building upon previous knowledge. But before getting into that, let’s first talk about what the model building process looks like today.

We usually follow a process called static learning when building models. In this process, we train a model using the latest available data. We tweak and tune the model in the training process. And once we’re happy with its performance, we deploy it. This model is in production for a while. Then we notice that the model performance is getting worse over time. That’s when we throw away the existing model and build a new one using the latest available data. And we rinse and repeat this same process.

Let’s illustrate this using a concrete example. Consider this hypothetical scenario. We started building a fraud model at the end of January 2023. This model detects whether a credit card transaction is fraudulent or not. We train our model using all the credit card transaction data that we had from the past one-year period (January 2022 to December 2022) and use transaction data from this month (January 2023) to test the model.

At the end of next month we notice that the model isn’t doing too well against new data. So we built another model, but this time using data from the past one-year period (February 2022 to January 2023) to train it and then use the current month’s data (February 2023) to test it. And all data outside of these training and testing periods is thrown out.

Next month, we again notice that the model performance isn’t holding up against new data. And again, we build a new model using data from the past one-year period.

And we keep repeating this process whenever we see a decline in model performance. This doesn’t have to be after 1 month. It could be after 3 months, 6 months, or even a year.

And why do we do this batching of data?

3 main reasons.

  1. Concept drift: As time goes on, we see a phenomenon called concept drift, which means that what we’re trying to predict changes over time and using older data might sometimes be counterproductive.
  2. Memory constraints: The larger our training set, the more memory it occupies. So we try to limit the data we input into the model.
  3. Time constraints: In the same vein as reason 2, the larger our training data, the longer it takes for our model to train. (Although this usually isn’t such a big concern for a lot of models we build. Where this might be problematic is in NLP models.)

But what if we don’t want to throw away all the old models and data? Throwing away the old models means wasting all the knowledge that the old models have gathered so far. Ideally, we want to find a way to retain the previous knowledge, while gradually adding the information coming from new data. We want to preserve this ‘institutional knowledge’ because it is essential for adapting to slowly changing or recurring patterns.

And this is exactly what incremental learning does.

In incremental learning, the model learns and enhances its knowledge progressively, without forgetting previously acquired information. It grows with the data and becomes more refined over time.

To illustrate this let’s go back to our fraud model example. We start the same way as we did in static learning. We build a model using data from the previous year, but when we get new data next month instead of building a new model from scratch, we just add our new month’s data to the already existing model. And we repeat this process in 2 months, 3 months, and so on. So here we aren’t technically building new models but we are building new versions of the same model.

This can be a good thing because a) It’s an efficient use of resources because less data storage = more memory saving. Each iteration uses less memory, reducing costs. b) This is great for dynamic data, which is most data in the real world because we can continuously update predictions as and when we get new data instead of building a new model each time. c) The training procedure is carried out on a smaller portion of data, so it is much faster.

Fraud detection actually serves as a great example of how incremental learning can be beneficial, exemplified by Mastercard’s real-time fraud detection system. With each transaction, Mastercard’s system examines more than 100 variables (such as transaction size, location, and merchant type) to gauge the likelihood of fraud. (source: DataCamp) The system uses incremental learning to adjust to changing patterns of fraudulent activity. In dynamic environments like financial fraud, where fraudsters constantly adapt their methods, the challenge of concept drift is significant. Therefore, our models must adapt swiftly to maintain their performance and effectively combat fraudsters and their evolving tactics.

And the good news is – incremental learning has already been built into some of our favorite models like XGBoost and CatBoost. And it’s very easy to implement!

I explain the mathematical details behind XGBoost and CatBoost in my previous articles.

Let’s test static and incremental learning on real data to compare performance.

We’ll use this Credit Card Fraud dataset (CC0) to build the models. After some feature cleaning and selection, we end up with a dataset like this:

Where is_fraud is our target (y) column.

Let’s start with static learning. We build the same 12 XGBoost models…

model = xgb.XGBClassifier(scale_pos_weight=10).fit(X_train, y_train)

…over the following 12 one-year training periods:

At the end of every iteration or training period, we are shifting the training dates ahead by a month. Then we test each of these models over the following one-month test periods, which start right when their corresponding training periods ends:

And then we record the 12 AUC scores:

roc_auc_score(y_test, model.predict_proba(X_test)[:,1])

We record these scores to compare them to the ones we’ll get from the incremental models. Remember the higher our AUC score, the better our model is performing.

Now onto our incremental learning models. We start by training the first model over the same one-year period as we did in static learning. This is our base model. But for the next 11 models, we feed in the new months’ data to the already existing model incrementally.

So we’re continuing to train the model right where we left off each time.

The model looks pretty much the same as before except for one little change.

model = xgb.XGBClassifier(scale_pos_weight=10).fit(X_train, y_train, xgb_model=model)

Here, we declare the previous model in the current xgboost model we’re building using the parameterxgb_model. By retaining the previously trained model, the training process becomes faster and more efficient, as the model doesn’t need to learn from scratch each time.

Then we test the 12 models using the same 12 one-month test periods as we did in the static models…

…and record the AUC scores:

roc_auc_score(y_test, model.predict_proba(X_test)[:,1])

Now for the fun part where we compare the performance of the two processes. Out of the 11 recorded AUC scores (because the first score is the same since we used the same training and testing data), incremental learning had better AUC scores in 7 out of the 11 iterations!

Having said that there are a couple of caveats:

  • Overfitting is a concern with incremental learning because it relies on a continuous stream of data. The risk is that it might over-adjust its parameters based on recent data, which may not accurately represent the overall distribution. In contrast, state learning can consider the entire distribution.
  • While incremental learning can handle evolving data, abrupt changes in data trends can pose a challenge. Thus, it may not be suitable for data that changes too drastically.
  • Incremental learning faces a phenomenon called catastrophic forgetting, where old knowledge is lost as new data is learned and it’s difficult to determine what specific information is forgotten.

Although there are some considerations to keep in mind, it is beneficial to explore the integration of this approach into models by optimizing each version of the model. We can achieve this by fine-tuning parameters or refining our feature selection, in order to enhance the results even further.


And that is all on incremental learning! As always, let me know if you have any comments/questions/concerns, and feel free to connect with me on LinkedIn or email me at [email protected].

Unless specified, all images are by the author.


Related Articles