Data Scientists — Stop Using A Random Seed

Or why random_state=42 is an anti-pattern

Dr. Ori Cohen
Towards Data Science

Systematic Randomness (Gears work team together), loginueve ilustra, Pixabay

The following is my opinion and reasoning about using a random seed in your work, model development & production. Regardless of the number chosen, which is in a lot of cases a well-known number, such as 42. There are certain use cases where you can use a random seed if you want to maintain consistency, and others you should completely avoid, where consistency is the wrong choice. This article will review them.

Use Cases To Avoid

Model Training

I have seen many data scientists use a random seed for training, there are a couple of main reasons not to do that, let’s get into them:

  1. You may forget your random seeds when moving to production and your model may not be robust enough if it relies on a single random seed, which may affect internal model decisions, layer initialization, and other factors.
  2. You will always train using the same seed and base your decisions on a value that will result in random seed overfitting. On the other hand, if you don't have a random seed when you are building your model, you will build better trust in your model, because it has seen various randomization effects and it will be more robust to random number changes. In other words…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Responses (4)