The world’s leading publication for data science, AI, and ML professionals.

Why do we set a random state in machine learning models?

random_state=42

Image by aalmeidah from Pixabay
Image by aalmeidah from Pixabay

You may already use random state in your machine learning models. Do you know that random state is a model hyperparameter used to control the randomness involved in machine learning models?

In Scikit-learn, the random state hyperparameter is denoted by random_state . It usually takes one of the following values.

  • None: This is the default value. This allows the function to use the global random state instance from np.random . If you call the same function multiple times with random_state=None, that function will produce different results across different executions.
  • int: We can use an integer for random_state . Yes! We can use any integer including 0, but not negative ones, only positive integers. The most popular integers are 0 and 42. When we use an integer for random_state, the function will produce the same results across different executions. The results are only changed if we change the integer value.

Using the random state – A classic example

Let’s see how random state works with an example. For this, we use Scikit-learn’s train_test_split() function and LinearRegression() function. The train_test_split() function is used to split the dataset into train and test sets. By default, the function shuffles the data (with shuffle=True) before splitting. The random state hyperparameter in the train_test_split() function controls the shuffling process.

With random_state=None , we get different train and test sets across different executions and the shuffling process is out of control.

With random_state=0 , we get the same train and test sets across different executions. With random_state=42, we get the same train and test sets across different executions, but in this time, the train and test sets are different from the previous case with random_state=0 .

The train and test sets directly affect the model’s performance score. Because we get different train and test sets with different integer values for random_state in the train_test_split() function, the value of the random state hyperparameter indirectly affects the model’s performance score.

Now, see the following code.

Now, we try integer values 0, 35 and 42 for random_state and re-execute the above code three times. We’ll get the following results.

  • For random_state=0, we get an RMSE of 909.81.
  • For random_state=35, we get an RMSE of 794.15.
  • For random_state=42 , we get an RMSE of 824.33.

We get three significantly different RMSE values for the model depending on the integer value used in the random state hyperparameter.

Now, there is a clear question. What value do we accept as the correct RMSE value? We’ll not accept any single value. Instead, we get the average of these RMSE values. It is better to re-execute the above code as many times as possible (e.g. 10 times) and get the average RMSE. Doing this manually is boring. Instead, we can automate this with Scikit-learn cross_val_score() function.

Cross-validation results (Image by author)
Cross-validation results (Image by author)

Other popular machine learning algorithms in which a random state is used

The train_test_split() function is a classic example in which a random state is used. In addition to that, the following machine learning algorithms include the random state hyperparameter.

  • KMeans(): The random_state in the KMeans algorithm controls the random number generation for centroid initialization. For more details, read this article written by me.
  • RandomizedSearchCV(): The random_state in the RandomizedSearchCV function controls the randomization of getting the sample of hyperparameter combinations across different executions. For more details, read this article written by me.
  • DecisionTreeRegressor() or DecisionTreeClassifier(): The random_state in these algorithms controls the randomness involved during the node splitting process by searching for the best feature. It will define the tree structure.
  • RandomForestRegressor() or RandomForestClassifier(): The random_state in these algorithms controls two randomized processes – bootstrapping of the samples when creating tress and getting a random subset of features to search for the best feature during the node splitting process when creating each tree. For more details, read this article written by me.
  • EllipticEnvelope(): The random_state in the EllipticEnvelope function determines the random number generator for shuffling the data. For more details, read this article written by me.

When to use a random state

The following points give the answers to the question "Why do we set a random state in machine learning models?" – Today’s topic.

We generally use a random state in machine learning models for the following reasons.

  • Consistency: Sometimes, we need consistent results across different executions of the models. When I write Data Science tutorials, I always set an integer value for the random state in machine learning models. This is because I need to get the same results when running the model at different times and I want you to get the same results when you try out my code.
  • Experimental purposes: Sometimes, we tune our models manually. In those cases, we want to keep all other hyperparameters including the random state constant except the one(s) we’re tuning. For that purpose, we can set an integer for the random state in machine learning models.
  • Increase model performance: Sometimes, you can get significant performance improvement for your model by running it multiple times with different random states. This is because random_state is also a hyperparameter and we can tune that one also to get better results. This is very useful when writing tutorials and at Machine Learning competitions. But this is not recommended for production environments or other similar practical scenarios where a little change in accuracy score severely affects the end result.

When not to use a random state

This is a confusing question. When you don’t specify an integer for the Random State hyperparameter, the default, None, always applies behind the scenes. The None allows the function to use the global random state instance from np.random . This is also a type of a single random state that the model relies on!

In other words, not specifying a random state does not mean that the function is not using a random state. It uses one defined by NumPy random state instance.

This is much more dangerous and should be avoided in real-life scenarios such as production environments, medical fields, etc. As a solution, we can get the average by doing cross-validation as I discussed earlier.

Summary

Different types of randomization tasks are involved in machine learning models and other related functions. When splitting a dataset, splitting a node in a decision tree or a random forest, initializing centroids in clustering, randomization takes place. The random state hyperparameter is used to control any such randomness involved in machine learning models to get consistent results.

We can use cross-validation to mitigate the effect of randomness involved in machine learning models.

The random state hyperparameter gives direct control over multiple types of the randomness of different functions.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
X_train, X_test, y_train, y_test = train_test_split(random_state=42)
rf = RandomForestRegressor(random_state=42)

Even if we’ve used the same integer for two random states, the randomness of each function is controlled individually. It does not affect each other.


This is the end of today’s post.

Please let me know if you’ve any questions or feedback.

I hope you enjoyed reading this article. If you’d like to support me as a writer, kindly consider signing up for a membership to get unlimited access to Medium. It only costs $5 per month and I will receive a portion of your membership fee.

Join Medium with my referral link – Rukshan Pramoditha

Thank you so much for your continuous support! See you in the next article. Happy learning to everyone!


Read next (recommended) – Written by me!

Learn how to do cross-validation with examples.

k-fold cross-validation explained in plain English

See how random forests work behind the scenes and the two types of randomness involved in random forests.

Random forests – An ensemble of decision trees

Learn the difference between a parameter and a hyperparameter.

Parameters Vs Hyperparameters: What is the difference?


Special credit goes to aalmeidah on Pixabay, **** who provides me with a nice cover image for this post. Kindly note that I slightly edited the image.

Rukshan Pramoditha 2022–04–30


Related Articles