The world’s leading publication for data science, AI, and ML professionals.

Random Numbers in Machine Learning

All about pseudo-random numbers, seeding, and reproducibility

Photo by Riho Kroll on Unsplash
Photo by Riho Kroll on Unsplash

Machine Learning relies on statistics, and random numbers are important to the performance of many steps in the data processing and model training pipeline. Modern machine learning frameworks provide abstractions and functions that implement randomness under the hood, and for us as data scientists and machine learning engineers, the details of random number generation often remain obscure.

In this article, I want to shed some light on random numbers in machine learning. You will read about:

  • 3 examples of the use of random numbers in machine learning
  • Generating (pseudo-)random numbers
  • Fixing random numbers by seeding
  • Reproducible machine learning: necessary lines of code for scikit-learn, tensorflow, and pytorch.

By the end of this article, you will know what happens when you use random numbers in your machine learning pipeline, and you will learn the necessary lines of code to ensure reproducibility of your machine learning algorithms.


3 Examples for the use of random numbers in machine learning

To illustrate the importance of random numbers, we discuss three examples where they are relevant along the machine learning pipeline.

  1. Creating train/test splits of a dataset
  2. Weight initialization in a neural network
  3. Choosing minibatches during training

Train/test splitSplitting your dataset into training and test data is one of the most important steps in evaluating the performance of a machine learning algorithm. We are interested in creating models that generalize well to data not used during training. To this end, a collection of data samples is divided into at least two disjoint sets.

The training data is used to train the algorithm, i.e. to iteratively fix the model parameters. The test data is used to validate the algorithm by applying a trained model to test data and reporting appropriate metrics.

Random split of a dataset into train and test data. Image: Author.
Random split of a dataset into train and test data. Image: Author.

The popular scikit-learn function sklearn.model_selection.train_test_split uses random numbers. Under the hood, it creates an array of indices corresponding to the length of the data set. The indices are then randomly assigned to the train and test data.

Weight initialization The layers of a neural network contain parameters that are adjusted during training. For linear layers, these parameters are weights and biases. They are initially assigned random values. Consider the example of a linear layer with 4 neurons connected to an output layer with 3 potential classes: This layer corresponds to a matrix W of 4 x 3 weights.

Three different random initializations for the 4 x 3 matrix W. Image: Author.
Three different random initializations for the 4 x 3 matrix W. Image: Author.

The initialization of the weights is crucial for the convergence of the model training. If all weights had the same value, the backpropagation algorithm would see no reason to treat each neuron differently when updating the weights.

Therefore, when the model is first instantiated, the weights are chosen randomly. Typically, the random numbers have a reasonable statistical distribution behind them. Xavier Glorot found in 2010 that training a feed-forward neural network of layer dimension n x m is improved when the network weights are initialized from the uniform distribution [1].

Choosing training batchesIn general, it is neither computationally feasible nor advisable to use the entire training dataset to update the model parameters in one go. Therefore, the training dataset is divided into minibatches of fixed size. The dataloader creates these minibatches and can randomly shuffle the data. This is to prevent data from entering training in a biased way, e.g. in a temporal order because of the way it was collected.

Minibatches of size 4 formed from a dataset with 12 samples. Top: Dataset is not shuffled. Bottom: Dataset is shuffled randomly. Image: Author.
Minibatches of size 4 formed from a dataset with 12 samples. Top: Dataset is not shuffled. Bottom: Dataset is shuffled randomly. Image: Author.

Stochastic gradient descent relies on the randomness of these minibatches. By presenting random subsets of the training data to the algorithm, each backpropagation step emphasizes a slightly different aspect of the training data. This avoids getting stuck in local minima during training.


Pseudo-random numbers

Modern Programming languages and machine learning frameworks provide tools for the developer to generate random numbers without worrying about the underlying algorithm. To generate a sequence of 100 uniform random numbers between 0 and 1 in Python, simply type

import numpy as np
np.random.rand(100)

But what actually happens under the hood when you execute this line of code? Enter the Random Number Generators (RNGs).

A standard computer algorithm produces predictable results, which is exactly the opposite of what we want— a random sequence of numbers. What we are looking for is an algorithm that produces sequences of numbers that are hard to predict. Note that hard does not mean impossible! So we need a source of entropy and a cryptographic algorithm. For a true random number generator, the source of entropy could be informed by the environment, e.g. by taking a sensor temperature or the decay of a radioactive particle as input [3].

For machine learning, we do not need the same high-end randomness that would be required for a cryptographic application. Python – specifically the numpy library – implements an RNG to generate pseudo-Random Numbers.

In everyday language, a pseudo-random number sequence is close enough to a truly random sequence that the lack of complete randomness does not affect the purpose of the algorithm. The correlation length of the sequence must be sufficient for the application.

From an initial seed, a function f is applied to generate a state. Another function g is applied to generate a random number. This process is repeated as many times as necessary. The functions f, g should not be invertible. Also, the different states must provide enough variability to really generate a lot of different random numbers, the so-called sequence length.

Pseudo-random number generator following [4]. Image: Author.
Pseudo-random number generator following [4]. Image: Author.

The underlying algorithm used in Python is called Mersenne Twister [4]. It provides sequences of random numbers for each state and is computationally more efficient than the simple implementation shown above. It is not cryptographically secure, but it is fast and sufficiently powerful.

Random number distributions

Often we are interested in drawing random numbers that follow a certain statistical distribution. For example, the uniform distribution assigns equal probability to all numbers in a given interval. The normal distribution follows a Gaussian curve with a fixed mean and variance.

Normal distribution (left) and uniform distribution (right) of 1000 random numbers. Image: Author.
Normal distribution (left) and uniform distribution (right) of 1000 random numbers. Image: Author.

Distributions of random numbers can be transformed into each other by their cumulative distribution function (CDF). Thus, it is sufficient to be able to draw uniform numbers, and any other distribution can be generated from them. The details are beyond the scope of this article, for a start we recommend [5].

Fixing random numbers by seeding

As explained above, the seed is an integer that serves as input for the pseudo-random number generator.

For a fixed seed, the sequence of pseudo-random numbers generated by the presented algorithm is always the same.

We can use this insight to achieve two different objectives.

Choose a new random seedWhen you choose a new random seed, all aspects of the machine learning pipeline that rely on random numbers are initialized with different values. While the underlying distributions and dataset sizes remain unchanged, the training process can be affected.

Ideally, your training procedure should not be too dependent on the choice of random seed. You want to make sure that the training converges regardless of the initialization, and not rely on golden runs that magically produce great results without ever being reproducible.

Fix the random seedFixing the random seed is particularly popular for teaching purposes. When I create a simple model for a small dataset, the probability that my results depend on the random seed is quite high. Whenever I prepare teaching materials, such as a Jupyter notebook, for students, I fix the random seed so that they get the same results. This reduces confusion and allows students to focus on the new concepts rather than the less important digits of the metrics.

Clones look exactly the same. Photo by Phil Shaw on Unsplash
Clones look exactly the same. Photo by Phil Shaw on Unsplash

Reproducibility in machine learning

We have now learned that random numbers are critical to the performance of the machine learning process. However, if we want exact reproducibility of our algorithm, we need to fix the random seeds.

It is important to realize that the random seed must be fixed separately for each library that uses random numbers. For example, even if we fix the numpy random seed by np.random.seed(789123) , the torch random seed will not be affected by that and training will not be reproducible.

In the following, we summarize the necessary calls to fix random seeds for different popular machine learning frameworks.

Scikit-Learn uses numpy’s random number generator throughout. To fix the random seed for this framework, it is sufficent to set

import numpy as np
np.random.seed(789456123)

The popular framework offers another function to access random number generators, _check_random_state_. To generate 1000 random numbers quickly:

from sklearn.utils.validation import check_random_state
rs = check_random_state(12345)
rs.rand(1000)

Tensorflow 2 provides its own random number generator module in tf.random . According to the documentation, fixing random seeds there ensures reproducibility. However, it is not guaranteed across different versions of tensorflow. To generate the same pseudo-random numbers, use the stateless_XXX functions from the tf.random module, e.g., for a series of 1000 random numbers following a normal distribution:

from tf.random import stateless_normal

shape = 1000
seed = 12344
stateless_normal(shape, seed)

Pytorch controls the random number generation in a way that is very similar to the numpy statements. The following will set the seed for all procedures that rely on this random number generator:

import torch
torch.manual_seed(9870)

However, they point out that some functions may depend on the numpy random number generator, so it is advisable to fix that seed as well.

The Pytorch documentation points out a problem with full reproducibility. Not only is it impossible to reproduce the random numbers exactly between versions, but the numbers may also depend on the hardware. Using different GPUs with different CUDA toolkits or CPUs can lead to different random number sequences.

Summary

Random numbers are critical to the performance of machine learning algorithms. They ensure that datasets are partitioned in an unbiased manner, improve the generalizability of the algorithm, and increase the convergence of training procedures.

Under the hood, sequences of random numbers are generated by so-called RNGs, established algorithms that are able to provide random numbers following different distributions.

The random seed can be used to fix the sequence of random numbers. This makes the machine learning algorithm reproducible, which is especially useful for teaching purposes and for academic papers. Finally, remember that if you are using a mix of machine learning frameworks in your project, you may need to set the random seed for all of them!

References

  1. Xavier Glorot and Yoshua Bengio, "Understanding the difficulty of training deep feedforward neural networks", Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR 9:249–256, 2010. [paper]
  2. https://numpy.org/doc/stable/reference/random/index.html
  3. https://www.redhat.com/en/blog/understanding-random-number-generators-and-their-limitations-linux
  4. Mersenne Twister in detail: https://www.cryptologie.net/article/331/how-does-the-mersennes-twister-work/
  5. https://www.wikiwand.com/en/Inverse_transform_sampling
  6. Tensorflow: https://www.tensorflow.org/guide/random_numbers
  7. Pytorch: https://pytorch.org/docs/stable/notes/randomness.html

Related Articles