How to Use Random Seeds Effectively
Best practices for an often overlooked part of the ML process
Building a predictive model is a complex process. You need to get the right data, clean it, create useful features, test different algorithms, and finally validate your model’s performance. However, this post covers an aspect of the model-building process that doesn’t typically get much attention: random seeds.
❓ What is a Random Seed?
A random seed is used to ensure that results are reproducible. In other words, using this parameter makes sure that anyone who re-runs your code will get the exact same outputs. Reproducibility is an extremely important concept in data science and other fields. Lots of people have already written about this topic at length, so I won’t discuss it any further in this post.
Depending on your specific project, you may not even need a random seed. However, there are 2 common tasks where they are used:
1. Splitting data into training/validation/test sets: random seeds ensure that the data is divided the same way every time the code is run
2. Model training: algorithms such as random forest and gradient boosting are non-deterministic (for a given input, the output is not always the same) and so require a random seed argument for reproducible results
In addition to reproducibility, random seeds are also important for bench-marking results. If you are testing multiple versions of an algorithm, it’s important that all versions use the same data and are as similar as possible (except for the parameters you are testing).
How Random Seeds Are Usually Set
Despite their importance, random seeds are often set without much effort. I’m guilty of this. I typically use the date of whatever day I’m working on (so on March 1st, 2020 I would use the seed 20200301). Some people use the same seed every time, while others randomly generate them.
Overall, random seeds are typically treated as an afterthought in the modeling process. This can be problematic because, as we’ll see in the next few sections, the choice of this parameter can significantly affect results.
🚢 Titanic Data
Now, I’ll demonstrate just how much impact the choice of a random seed can have. I’ll use the well-known Titanic dataset to do this (download link is below).
The following code and plots are created in Python, but I found similar results in R. The complete code associated with this post can be found in the GitHub repository below:
First, let’s look at a few rows of this data:
import pandas as pd
train_all = pd.read_csv('train.csv') # Show selected columns
train_all.drop(['PassengerId', 'Parch', 'Ticket', 'Embarked', 'Cabin'], axis = 1).head()
The Titanic data is already divided into training and test sets. A classic task for this dataset is to predict passenger survival (encoded in the Survived
column). The test data does not come with labels for the Survived
column, so I’ll be doing the following:
1. Holding out part of the training data to serve as a validation set
2. Training a model to predict survival on the remaining training data and evaluating that model against the validation set created in step 1
Splitting Data
Let’s start by looking at the overall distribution of the Survived
column.
In [19]: train_all.Survived.value_counts() / train_all.shape[0]
Out[19]:
0 0.616162
1 0.383838
Name: Survived, dtype: float64
When modeling, we want our training, validation, and test data to be as similar as possible so that our model is trained on the same kind of data that it’s being evaluated against. Note that this does not mean that any of these 3 data sets should overlap! They should not. But we want the observations contained in each of them to be broadly comparable. I’ll now split the data using different random seeds and compare the resulting distributions of Survived
for the training and validation sets.
from sklearn.model_selection import train_test_split# Create data frames for dependent and independent variables
X = train_all.drop('Survived', axis = 1)
y = train_all.Survived # Split 1
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 135153) In [41]: y_train.value_counts() / len(y_train)
Out[41]:
0 0.655899
1 0.344101
Name: Survived, dtype: float64 In [42]: y_val.value_counts() / len(y_val)
Out[42]:
0 0.458101
1 0.541899
Name: Survived, dtype: float64
In this case, the proportion of survivors is much lower in the training set than the validation set.
# Split 2
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 163035) In [44]: y_train.value_counts() / len(y_train)
Out[44]:
0 0.577247
1 0.422753
Name: Survived, dtype: float64 In [45]: y_val.value_counts() / len(y_val)
Out[45]:
0 0.77095
1 0.22905 Name: Survived, dtype: float64
Here, the proportion of survivors is much higher in the training set than in the validation set.
Full disclosure, these examples are the most extreme ones I found after looping through 200K random seeds. Regardless, there are a couple of concerns with these results. First, in both cases, the survival distribution is substantially different between the training and validation sets. This will likely negatively affect model training. Second, these outputs are very different from each other. If, as most people do, you set a random seed arbitrarily, your resulting data splits can vary drastically depending on your choice.
I’ll discuss best practices at the end of the post. Next, I want to show how the training and validation Survival
distributions varied for all 200K random seeds I tested.
~23% of data splits resulted in a survival percentage difference of at least 5% between training and validation sets. Over 1% of splits resulted in a survival percentage difference of at least 10%. The largest survival percentage difference was ~20%. The takeaway here is that using an arbitrary random seed can result in large differences between the training and validation set distributions. These differences can have unintended downstream consequences in the modeling process.
📈 Model Training
The previous section showed how random seeds can influence data splits. In this section, I train a model using different random seeds after the data has already been split into training and validation sets (more on exactly how I do that in the next section).
As a reminder, I’m trying to predict the Survived
column. I’ll build a random forest classification model. Since the random forest algorithm is non-deterministic, a random seed is needed for reproducibility. I’ll show results for model accuracy below, but I found similar results using precision and recall.
First, I’ll create a training and validation set.
X = X[['Pclass', 'Sex', 'SibSp', 'Fare']] # These will be my predictors # The “Sex” variable is a string and needs to be one-hot encoded
X['gender_dummy'] = pd.get_dummies(X.Sex)['female']
X = X.drop(['Sex'], axis = 1) # Divide data into training and validation sets
# I’ll discuss exactly why I divide the data this way in the next section
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 20200226, stratify = y)
Now I’ll train a couple of models and evaluate accuracy on the validation set.
# Model 1 from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score # Create and fit model
clf = RandomForestClassifier(n_estimators = 50, random_state = 11850)
clf = clf.fit(X_train, y_train)
preds = clf.predict(X_val) # Get predictions In [74]: round(accuracy_score(y_true = y_val, y_pred = preds), 3) Out[74]: 0.765 # Model 2# Create and fit model
clf = RandomForestClassifier(n_estimators = 50, random_state = 2298)
clf = clf.fit(X_train, y_train)
preds = clf.predict(X_val) # Get predictionsIn [78]: round(accuracy_score(y_true = y_val, y_pred = preds), 3)
Out[78]: 0.827
I tested 25K random seeds to find these results, but a change in accuracy of >6% is definitely noteworthy! Again, these 2 models are identical except for the random seed.
The plot below shows how model accuracy varied across all of the random seeds I tested.
While most models achieved ~80% accuracy, there are a substantial number of models scoring between 79%-82% and a handful of models that score outside of that range. Depending on the specific use case, these differences are large enough to matter. Therefore, model performance variance due to random seed choice should be taken into account when communicating results with stakeholders.
Best Practices
Now that we’ve seen a few areas where the choice of random seed impacts results, I’d like to propose a few best practices.
For data splitting, I believe stratified samples should be used so that the proportions of the dependent variable (Survived
in this post) are similar in the training, validation, and test sets. This would eliminate the varying survival distributions above and allows a model be trained and evaluated on comparable data.
The train_test_split
function can implement stratified sampling with 1 additional argument. Note that if a model is later evaluated against data with a different dependent variable distribution, performance may be different than expected. However, I believe stratifying by the dependent variable is still the preferred way to split data.
Here’s how stratified sampling looks in code.
# Overall distribution of “Survived” column
In [19]: train_all.Survived.value_counts() / train_all.shape[0]
Out[19]:
0 0.616162
1 0.383838
Name: Survived, dtype: float64 # Stratified sampling (see last argument)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state = 20200226, stratify = y) In [10]: y_train.value_counts() / len(y_train)
Out[10]:
0 0.616573
1 0.383427
Name: Survived, dtype: float64 In [11]: y_val.value_counts() / len(y_val)
Out[11]:
0 0.614525
1 0.385475
Name: Survived, dtype: float64
Using the stratify
argument, the proportion of Survived
is similar in the training and validation sets. I still use a random seed as I still want reproducible results. However, it’s my opinion that the specific random seed value doesn’t matter in this case.
That addresses data splitting best practices, but how about model training? While testing different model specifications, a random seed should be used for fair comparisons but I don’t think the particular seed matters too much.
However, before reporting performance metrics to stakeholders, the final model should be trained and evaluated with 2–3 additional seeds to understand possible variance in results. This practice allows more accurate communication of model performance. For a critical model running in a production environment, it’s worth considering running that model with multiple seeds and averaging the result (though this is probably a topic for a separate blog post).
🏁 Conclusion
Hopefully I’ve convinced you to pay a bit of attention to the often-overlooked random seed parameter. Feel free to get in touch if you’d like to see the full code used in this post or have other ideas for random seed best practices!
If you enjoyed this post, check out some of my other work below!