The world’s leading publication for data science, AI, and ML professionals.

Random Numbers In NumPy

How to generate random numbers in Python using NumPy's updated method

Photo by Black ice from Pexels
Photo by Black ice from Pexels

In common understanding, "1 2 3 4 5" is not as random as "3 5 2 1 4" and certainly not as random as "47 88 1 32 41" but "we can’t say authoritatively that the first sequence is not random … it could have been generated by chance." – Wikipedia

If you do an online search about how to generate random numbers in Python NumPy, most probably you will see this kind of code:

Sometime ago NumPy had updated its method of generating random numbers but almost all of the search results are littered with outdated code like above, even today. So I decided to write a small blog-post about explaining the updated method. The outdated method (known as legacy RandomState in NumPy documentation) still works but it is 2–10 times slower than the newer method.

So, how do you generate random numbers now?

Generating Floats and Integers

You can just copy-paste the code and run it directly in Jupyter notebook or a Python interpreter (I have Python 3.9.2 & NumPy 1.20.1 as default in my Arch Linux installation)

As usual, seed=(some number) option can be used to create reproducible random numbers and it works just fine. Down here first two screenshots were taken at the same time in two different tabs in my terminal while 3rd screenshot was taken months later in a different Python version. One is from the default Python interpreter and the others are from the IPython interpreter. Check how even the second run of the random method call reproduces the exact same numbers:

Image by Author
Image by Author
Image by Author
Image by Author
Image by Author
Image by Author

You can see Generator(PCG64) in the 2nd screenshot. We will come to that in a while. First, let’s explore more of floats and integers random number generation:

This optional argument "out" does not work with every call to random. Only for a selected few. Check the link at the end of this post [4]

Here is a screenshot from my Jupyter Lab showing a comparison of both methods in terms of speed:

Image by Author
Image by Author

Benchmark tests you find on the internet are generally dependent on a context. Still, this updated method is faster.

Generating Normal Distribution

Any conversation of random numbers is incomplete if we leave out generating a normal distribution. This is a Google data science interview question taken from Interview Query (Jay Feng):

Write a function to generate N samples from a normal distribution and plot the histogram

It is two-parts. First, we will generate normal distribution using the updated NumPy method and then we will plot a histogram using Matplotlib. One missing piece in the question is the question statement tells you to draw N samples from a normal distribution but does not tell you what mean and standard deviation the interviewer needs. So we can think of two ways:

  1. Write a function to draw N samples from a standard normal distribution (a standard normal distribution has a mean of 0 and a standard deviation of 1)
  2. Write a function that takes three arguments: mean, standard deviation, and N. It generates N number of samples from a normal distribution.

    And again, if you use the same seed=(some number) then you can reproduce the distribution:

Image by Author
Image by Author

Now let’s plot a histogram using matplotlib:

and this is how it will look:

Image by Author
Image by Author

We can plot the same using a normal distribution with some values for mean and standard deviation. You just need to replace the function with using_normal().

Matplotlib Histogram Example

Matplotlib.org has a histogram example in its gallery section. To use the updated method, you just need to change 3 lines of code:

and this is the output:

Image by Author
Image by Author

As you can see the histogram shows a normal distribution. If you compare with the original output and graph in the matplotlib gallery, both the output values and the graph have changed even though we used the same seed number. Why?

This has happened because older and newer methods use different ways to generate random numbers. Mersenne Twister pseudo-random number generator (MT19937) is what was used in old methods (and still can be used). The updated method uses Permutation Congruential generator (PCG-64). Without going into technical details: the primary difference between them is PCG-64 has better (think ‘far batter’) statistical properties when compared to Mersenne Twister. It is faster and space-efficient too.

That was all about the newer method of generating random numbers in Numpy. If you want full technical details, check these:

[1] https://numpy.org/doc/stable/reference/random/bit_generators/pcg64.html

[2] https://numpy.org/doc/stable/reference/random/bit_generators/mt19937.html

[3] https://www.pcg-random.org/

[4] https://numpy.org/doc/stable/reference/random/new-or-different.html

[5] https://en.wikipedia.org/wiki/Permuted_congruential_generator

[6] https://en.wikipedia.org/wiki/Mersenne_Twister


Related Articles