How to Use Random Seeds Effectively

Best practices for an often overlooked part of the ML process

Published in

Towards Data Science

8 min readApr 3, 2020

Building a predictive model is a complex process. You need to get the right data, clean it, create useful features, test different algorithms, and finally validate your model’s performance. However, this post covers an aspect of the model-building process that doesn’t typically get much attention: random seeds.

❓ What is a Random Seed?

A random seed is used to ensure that results are reproducible. In other words, using this parameter makes sure that anyone who re-runs your code will get the exact same outputs. Reproducibility is an extremely important concept in data science and other fields. Lots of people have already written about this topic at length, so I won’t discuss it any further in this post.

Depending on your specific project, you may not even need a random seed. However, there are 2 common tasks where they are used:

1. Splitting data into training/validation/test sets: random seeds ensure that the data is divided the same way every time the code is run

2. Model training: algorithms such as random forest and gradient boosting are non-deterministic (for a given input, the output is not always the same) and so require a random seed argument for reproducible results

In addition to reproducibility, random seeds are also important for bench-marking results. If you are testing multiple versions of an algorithm, it’s important that all versions use the same data and are as similar as possible (except for the parameters you are testing).

How Random Seeds Are Usually Set

Despite their importance, random seeds are often set without much effort. I’m guilty of this. I typically use the date of whatever day I’m working on (so on March 1st, 2020 I would use the seed 20200301). Some people use the same seed every time, while others randomly generate them.

Overall, random seeds are typically treated as an afterthought in the modeling process. This can be problematic because, as we’ll see in the next few sections, the choice of this parameter can significantly affect results.

🚢 Titanic Data

Now, I’ll demonstrate just how much impact the choice of a random seed can have. I’ll use the well-known Titanic dataset to do this (download link is below).

Titanic: Machine Learning from Disaster

Start here! Predict survival on the Titanic and get familiar with ML basics

www.kaggle.com

The following code and plots are created in Python, but I found similar results in R. The complete code associated with this post can be found in the GitHub repository below:

jai-bansal/random-seed-blog-post

This repository contains code supporting a blog post on random seed best practices. The blog post can be found here…

github.com

First, let’s look at a few rows of this data:

import pandas as pd
train_all = pd.read_csv('train.csv') # Show selected columns 
train_all.drop(['PassengerId', 'Parch', 'Ticket', 'Embarked', 'Cabin'], axis = 1).head()

The Titanic data is already divided into training and test sets. A classic task for this dataset is to predict passenger survival (encoded in the Survived column). The test data does not come with labels for the Survived column, so I’ll be doing the following:

1. Holding out part of the training data to serve as a validation set

2. Training a model to predict survival on the remaining training data and evaluating that model against the validation set created in step 1

Splitting Data

Let’s start by looking at the overall distribution of the Survived column.

In [19]: train_all.Survived.value_counts() / train_all.shape[0] 
Out[19]:
0    0.616162 
1    0.383838 
Name: Survived, dtype: float64

When modeling, we want our training, validation, and test data to be as similar as possible so that our model is trained on the same kind of data that it’s being evaluated against. Note that this does not mean that any of these 3 data sets should overlap! They should not. But we want the observations contained in each of them to be broadly comparable. I’ll now split the data using different random seeds and compare the resulting distributions of Survived for the training and validation sets.

from sklearn.model_selection import train_test_split# Create data frames for dependent and independent variables 
X = train_all.drop('Survived', axis = 1) 
y = train_all.Survived  # Split 1 
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 135153) In [41]: y_train.value_counts() / len(y_train) 
Out[41]:  
0    0.655899 
1    0.344101 
Name: Survived, dtype: float64  In [42]: y_val.value_counts() / len(y_val) 
Out[42]:  
0    0.458101 
1    0.541899 
Name: Survived, dtype: float64

In this case, the proportion of survivors is much lower in the training set than the validation set.

# Split 2 
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 163035) In [44]: y_train.value_counts() / len(y_train) 
Out[44]:  
0    0.577247 
1    0.422753 
Name: Survived, dtype: float64  In [45]: y_val.value_counts() / len(y_val) 
Out[45]:  
0    0.77095 
1    0.22905 Name: Survived, dtype: float64

Here, the proportion of survivors is much higher in the training set than in the validation set.

Full disclosure, these examples are the most extreme ones I found after looping through 200K random seeds. Regardless, there are a couple of concerns with these results. First, in both cases, the survival distribution is substantially different between the training and validation sets. This will likely negatively affect model training. Second, these outputs are very different from each other. If, as most people do, you set a random seed arbitrarily, your resulting data splits can vary drastically depending on your choice.

I’ll discuss best practices at the end of the post. Next, I want to show how the training and validation Survival distributions varied for all 200K random seeds I tested.

Some data splits have a substantial difference between training and validation set survival %.

~23% of data splits resulted in a survival percentage difference of at least 5% between training and validation sets. Over 1% of splits resulted in a survival percentage difference of at least 10%. The largest survival percentage difference was ~20%. The takeaway here is that using an arbitrary random seed can result in large differences between the training and validation set distributions. These differences can have unintended downstream consequences in the modeling process.

📈 Model Training

The previous section showed how random seeds can influence data splits. In this section, I train a model using different random seeds after the data has already been split into training and validation sets (more on exactly how I do that in the next section).

As a reminder, I’m trying to predict the Survived column. I’ll build a random forest classification model. Since the random forest algorithm is non-deterministic, a random seed is needed for reproducibility. I’ll show results for model accuracy below, but I found similar results using precision and recall.

First, I’ll create a training and validation set.

X = X[['Pclass', 'Sex', 'SibSp', 'Fare']]  # These will be my predictors # The “Sex” variable is a string and needs to be one-hot encoded 
X['gender_dummy'] = pd.get_dummies(X.Sex)['female'] 
X = X.drop(['Sex'], axis = 1)  # Divide data into training and validation sets 
# I’ll discuss exactly why I divide the data this way in the next section 
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 20200226, stratify = y)

Now I’ll train a couple of models and evaluate accuracy on the validation set.

# Model 1 from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import accuracy_score  # Create and fit model 
clf = RandomForestClassifier(n_estimators = 50, random_state = 11850) 
clf = clf.fit(X_train, y_train)  
preds = clf.predict(X_val)  # Get predictions  In [74]: round(accuracy_score(y_true = y_val, y_pred = preds), 3) Out[74]: 0.765  # Model 2# Create and fit model 
clf = RandomForestClassifier(n_estimators = 50, random_state = 2298)
clf = clf.fit(X_train, y_train)
preds = clf.predict(X_val)  # Get predictionsIn [78]: round(accuracy_score(y_true = y_val, y_pred = preds), 3)
Out[78]: 0.827

I tested 25K random seeds to find these results, but a change in accuracy of >6% is definitely noteworthy! Again, these 2 models are identical except for the random seed.

The plot below shows how model accuracy varied across all of the random seeds I tested.

Model accuracy varies depending on random seed.

While most models achieved ~80% accuracy, there are a substantial number of models scoring between 79%-82% and a handful of models that score outside of that range. Depending on the specific use case, these differences are large enough to matter. Therefore, model performance variance due to random seed choice should be taken into account when communicating results with stakeholders.

Best Practices

Now that we’ve seen a few areas where the choice of random seed impacts results, I’d like to propose a few best practices.

For data splitting, I believe stratified samples should be used so that the proportions of the dependent variable (Survived in this post) are similar in the training, validation, and test sets. This would eliminate the varying survival distributions above and allows a model be trained and evaluated on comparable data.

The train_test_split function can implement stratified sampling with 1 additional argument. Note that if a model is later evaluated against data with a different dependent variable distribution, performance may be different than expected. However, I believe stratifying by the dependent variable is still the preferred way to split data.

Here’s how stratified sampling looks in code.

# Overall distribution of “Survived” column 
In [19]: train_all.Survived.value_counts() / train_all.shape[0] 
Out[19]:  
0    0.616162 
1    0.383838 
Name: Survived, dtype: float64  # Stratified sampling (see last argument) 
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 20200226, stratify = y)   In [10]: y_train.value_counts() / len(y_train) 
Out[10]:  
0    0.616573 
1    0.383427 
Name: Survived, dtype: float64  In [11]: y_val.value_counts() / len(y_val) 
Out[11]:  
0    0.614525 
1    0.385475 
Name: Survived, dtype: float64

Using the stratify argument, the proportion of Survived is similar in the training and validation sets. I still use a random seed as I still want reproducible results. However, it’s my opinion that the specific random seed value doesn’t matter in this case.

That addresses data splitting best practices, but how about model training? While testing different model specifications, a random seed should be used for fair comparisons but I don’t think the particular seed matters too much.

However, before reporting performance metrics to stakeholders, the final model should be trained and evaluated with 2–3 additional seeds to understand possible variance in results. This practice allows more accurate communication of model performance. For a critical model running in a production environment, it’s worth considering running that model with multiple seeds and averaging the result (though this is probably a topic for a separate blog post).

🏁 Conclusion

Hopefully I’ve convinced you to pay a bit of attention to the often-overlooked random seed parameter. Feel free to get in touch if you’d like to see the full code used in this post or have other ideas for random seed best practices!

If you enjoyed this post, check out some of my other work below!

LA Traffic Data Analysis 🚗 : Part 1

Using open-source data to analyze collision patterns in Los Angeles

towardsdatascience.com

7 Tips for Developing Technical Trainings for your Organization

Techniques to bring your next training course idea to life

medium.com