Time Based Cross Validation

What happens when our data is not a time-series, but still have a time dimension which is very important? This is a Python solution for time-based cross-validation with all required inputs and an output matches scikit-learn methods.

Published in

Towards Data Science

4 min readJan 20, 2020

Training and evaluating machine learning models usually require a training set and a test set. In most cases, train and test splitting is done randomly by taking 20% of the data as test data, unseen by the model and using the rest for training.

When dealing with time-related and dynamically changing environments, where the characteristics of the environment change throughout time, it is best to use time-based splitting to provide statistically robust model evaluation and best simulate real-life scenarios. For this we should use time-based cross validation, a method taken from the time-series field, which forms a type of “sliding window” training approach.

This approach is well known in the time-series domain, where we have a signal which is a sequence taken at successive equally spaced points in time.

But what happens when our data is not a time-series, but still have a time dimension which is very important?

Example of a problem requires time-based cross-validation

We would like to predict delivery time of an order. Each record is an order, represented by set of features to create data table. We know when each order happened, and several orders could be placed on the same date. Detailed explanation of such problem could be found in my previous blog. In this case, our intention was to train a new model based on last month orders, then apply it to predict the delivery time of next week’s orders.
In order to best mimic real world, we should train our models using data taken from a time period of a month, then test them on new data captured from the following week. To create robust and general models, we should use several splitting-points in time and apply time-based cross validation. Our final test results would be the weighted average over all test windows.

We need to pay attention to 3 important aspects:

1. Time-based train\test split- in each split, test indices must be higher than before.

2. We would like to choose our train\test set sizes in order to mimic real world scenarios in which we will train a model over some period and then apply it on the upcoming period. For example- train the model over last month data and apply it to predict on the upcoming week data.

3. Dates matter. For our intention, the number of records in each set does not matter. What matters is the size of the windows in terms of days. We would like to split the data so that each window will consist data from X days.

Examples- how to use

Closing

The suggested solution worked very well on various problems I have encountered in the past. Please let me know how it worked out for you, and if you think more adjustments could be made to improve this class.

Special thanks to Noga Gershon who shared with me her experience with related challenges.
Thanks to Noga and Idan Richman-Goshen for their awesome technical feedback and for proofreading and reviewing this article.

Time Based Cross Validation

What happens when our data is not a time-series, but still have a time dimension which is very important? This is a Python solution for time-based cross-validation with all required inputs and an output matches scikit-learn methods.

Example of a problem requires time-based cross-validation

Other partial solutions

Suggested solution

Parameters:

Methods:

Examples- how to use

Closing

Written by Or Herman-Saffar