Data Scientists — Stop Using A Random Seed
Or why random_state=42 is an anti-pattern
The following is my opinion and reasoning about using a random seed in your work, model development & production. Regardless of the number chosen, which is in a lot of cases a well-known number, such as 42. There are certain use cases where you can use a random seed if you want to maintain consistency, and others you should completely avoid, where consistency is the wrong choice. This article will review them.
Use Cases To Avoid
Model Training
I have seen many data scientists use a random seed for training, there are a couple of main reasons not to do that, let’s get into them:
- You may forget your random seeds when moving to production and your model may not be robust enough if it relies on a single random seed, which may affect internal model decisions, layer initialization, and other factors.
- You will always train using the same seed and base your decisions on a value that will result in random seed overfitting. On the other hand, if you don't have a random seed when you are building your model, you will build better trust in your model, because it has seen various randomization effects and it will be more robust to random number changes. In other words…