The world’s leading publication for data science, AI, and ML professionals.

Future-Proof Your Data Partitions

Data Partitioning and The Ghost in the Machine

Photo by Ijaz Rafi on Unsplash
Photo by Ijaz Rafi on Unsplash

Splitting data into training and test partitions is an essential step toward improving our predictions. Modeling on some of the data and testing that model via prediction on the remaining samples is how we can understand and compensate for bias & variance, a central dilemma of Machine Learning.

Using Python to split data is very easy, with the train_test_split function in scikit-learn. Other than specifying the size of the training sample (or test sample) and, if this is a classification problem, whether to stratify the response variable is most of what you need. All that is left is to set the random_state value, which determines in a random fashion which rows go to training and which go to test.

In the past, for random seed values I have used:

  1. 42 (an homage to Douglas Adams)
  2. 451 (Ray Bradbury)
  3. 2001, 2010, and 2061 (Arthur C. Clarke)
  4. 314159 (only on Pi Day)
  5. My most common random seed = 1

The idea is that splitting the data results in a similar distribution of values between training and test regardless of random seed input, but I’ve been doing this wrong for quite some time.

There is a ghost in our machine (learning), and it’s called Variance.

Bias & Variance

But before we strap on our Ghostbuster Proton Packs to deal with this specter, let us visit the mathematical underpinnings of bias and variance to discover a path forward. The expected residual value, typically expressed as mean squared error, on the test data for a given value of a random variable X can be decomposed into three elements: 1) variance, 2) the squared bias, and 3) the variance of the residuals (James, Witten, Hastie, & Tibshirani, 2013). See Figure 1 for more details.

Figure 1: Mean Squared Error Decomposition
Figure 1: Mean Squared Error Decomposition

Variance, itself a random function, adds a degree of uncertainty to every predictive model regardless of learning algorithm, and we can use that randomness to construct more robust models. To take that leap, let’s examine a different equation that explains variance (Abu-Mostafa, Magdon-Ismail, & Lin, 2012). See Figure 2 for more details:

Figure 2: Variance Redux
Figure 2: Variance Redux

In this conceptualization, the expected residual value of a random variable X across an infinite number of datasets acted upon by learning model ‘g’ requires subtracting the estimated prediction from the mean of all estimated values, and herein lay our solution!

If we could split our dataset on the mean of the variance, which is both data and model-specific, we could increase the robustness of the model toward the deviations that data-from-the-future will certainly contain. Training and test partitions that resemble the mean of the variance could reduce the overall effect of variance itself.

But this was primarily a conceptual tool because it implied that the expected residuals are gathered by repeatedly estimating the learning function over an infinite number of datasets, and for each random X. But there is another way to use this concept, by simulating many datasets that are random deviations of the same dataset. By constructing a test accuracy/training accuracy ratio as the metric and iterating the random seed value of the train/test split, we can visualize the instabilities of the learning model (Abu-Mostafa et al., 2012) as they relate to a specific dataset; this is a local way to inspect variance. What was a purely theoretical concept can become an important new step in the template for business analytic modeling.

Finding the Variance Mean

Variance is model specific, containing both reducible and irreducible error, so any attempt to visualize this random function should use the intended modeling algorithm. Figure 3 demonstrates the variance mean and plot for 200 iterations of the IBM HR Analytics Employee Attrition & Performance binary classification dataset with a cross-validated logistic regression model, and you can see how partitioning on the mean reduces the amplitude swings of the variance and can diminish the scope of time-over-time deviations. Data preprocessing was minimal (drop low-information variables and one-hot encode categorical variables) and tuning is not required as we are not seeking accuracy but rather the instability that exists in all predictive models, so using default hyperparameter settings is expected.

Figure 3: Variance Plot and Mean Calculation
Figure 3: Variance Plot and Mean Calculation

Code Blocks 1 and 2 show how this can be accomplished for binary classification with f1_score as the accuracy metric. The use of 200 models is arbitrary; it was chosen as a balance between runtime and capturing the full range of variance, of which the largest spike occurred above random_state = 125 for this dataset+model.

# Code Block 1
from sklearn.linear_model import LogisticRegressionCV
model = LogisticRegressionCV(random_state=1, max_iter=5000)
size = 0.5

# Code Block 2
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, balanced_accuracy_score, precision_score, recall_score
import matplotlib.pyplot as plt
Var = []
for i in range(1, 200):
    train_X, test_X, train_y, test_y = train_test_split(X, y,
test_size = size, stratify=y, random_state = i)
    model.fit(train_X, train_y)
    train_pred = model.predict(train_X)
    test_pred = model.predict(test_X)
    train_error = f1_score(train_y, train_pred)
    test_error = f1_score(test_y, test_pred)
    variance = test_error/train_error
    Var.append(variance)

rs_value = np.average(Var)
def find_nearest(array, value):
    array = np.asarray(array)
    idx = (np.abs(array - value)).argmin()
    return array[idx]
nearest = find_nearest(Var, rs_value)
print('random_state = ', Var.index(nearest))
Random seed value for variance-mean partitioning
Random seed value for variance-mean partitioning
# Code Block 3 for plotting the variance graph
plt.figure(figsize=(10,8))
plt.plot(range(1, 200), Var, color='royalblue', label="Variance")
plt.plot([0, 200], [rs_value, rs_value], color='darkorange', linewidth=3)
plt.text(0,rs_value, rs_value, fontsize=15, color='black')
plt.legend(loc='lower center', shadow=True, fontsize='medium')
plt.show()

Changing this code for regression models is a simple conversion: remove ‘stratify=y’ from the data partition and change the error measure (e.g., mean squared error, mean absolute error, etc.). Code Block 1 sets up the modeling algorithm and test size while Code Block 2 gathers variance information, computes its mean, and finds the nearest random state value to the variance mean for programming the train/test split. Code Block 3 plots the variance and its mean.

There is a question as to whether this method induces information leakage, but we are not using performance from a test prediction to alter a model fit. Rather, we are using metadata from multiple models to discover the variance mean and align our initial data partition with that value. Nonetheless, this is a topic for discussion.

In the end, splitting data on the mean of the variance, despite requiring more one-time compute resources, should result in more robust models that will retain better accuracy as new, future data arrives.

References

Abu-Mostafa, Y. S., Magdon-Ismail, M., & Lin, H.-T. (2012). Learning from data (Vol. 4). AMLBook New York, NY, USA:

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). Springer.


Related Articles