Getting Started
Learn how to use your math and stats knowledge to solve data science and machine learning problems most efficiently. This is a series and we will start with Random Numbers.

Data science blog series:
When I started dusting off my math and stats knowledge (read Linear Algebra, Calculus and Probability) to combine those with Machine Learning concepts for the eventual application in Finance, I struggled big time.
I struggled not with the math and stats concept per se -I had some relevant background and thankfully there are truckload of online reference materials available just a click away- but I struggled with how to use those concepts in Machine Learning(ML) and Deep Learning(DL) problems. I struggled to establish a clear logical link between the math concepts and their applications in ML and DL, I struggled to find the purpose behind these concepts in the context of ML and DL.
Unfortunately, hours of googling didn’t help as there is no precise and concise material available to address this point. I had to learn this the hard way, that’s when I thought of starting a blog series to dumb it down for the fellow sufferers.
In this series, I will pick up a stats or math topic and simplify it to the most basic level so that people can use that in their applications in ML and DL. My intention is not to shed light on math and stats theories and nor on ML and DL concepts but on the linkages between these two. In other words, I will assume that you have a basic understanding of these concepts and I will not make any attempt to teach you these concepts (at least not in this blog series).
Let’s fasten our seat belts then, and start with the humble random number.
Randomness : Before we jump into Random Numbers, lets understand randomness and its importance. None of the future events in our daily and professional lives are absolutely certain; meeting your friends next weekend, call scheduled with your boss day after tomorrow, plans to go for jogging tomorrow early morning, nothing is certain.
And when it comes to business problems; closing price of a stock next Friday, Credit Rating of a bond at the end of next month, acquisition of number of new customers in the next three months, these are much more uncertain. This is because a larger number of factors affect their outcome.
More the uncertainty, more the randomness. But we don’t have any choice, we have to predict the outcome of the business events in order to make decisions.
What to do? We have to reduce the effect of randomness in the outcome of an event so that business decisions can be made with some certainty. How to reduce the effect of randomness? Simple. Incorporate randomness into the calculation of the outcome so that the effect of the randomness is mitigated. How to do that? In comes Random Numbers.
Random Numbers(RN) : Couple of important points about RN. We take the help of computers to generate RN for us. However, nothing that occurs inside a computer is random, hence computer generated RN come out of a deterministic routine and is pseudo-random. So, critical application like cryptography solution should not be based on pseudo-random number. Also, more random the random number generation (RNG) process, more predictive power our calculation will have, hence choose a strong RNG (Mersenne-Twister is most common and does a fairly good job for most of our regular business problems).
Random Number Distributions (RND) : So we understood that we need RN, but that’s not enough, we also need to know what type of RN we need, because each type of RN is associated with a specific type of business problem. The type of RN is decided by the probability distributions (PD) they are drawn from (for eg. Uniform, Normal etc.). In this context, the RND and PD are the same. To understand the linkage between random numbers and PD we need to know the definition of Random Variable. A random variable is a variable that can take on certain random numbers with certain probabilities. The collection of these probabilities is called the probability distribution for the random variable. A probability distribution specifies how the total probability (always 1) is distributed among the various possible outcomes. For example, you have calculated all the possible values of Brent closing price tomorrow with corresponding probabilities and have stored these in a random variable S. In this case, the different values of S and their probabilities is the PD of S.
Each PD follows a certain rule and has a set of parameters (for e.g. Mean, Variance, Skewness, Kurtosis, etc.) which dictates the generation of RN.
Now the final question, how do we know which RND I need for my business problem? For that, understand the business problem thoroughly and select the RND which characterizes the problem the best. A RND characterizes a problem best when the conditions of the distribution match those of the problem. For example, you think the opening price of a stock tomorrow is going to be between 90% and 110% of today’s closing price with each interim possibility equally likely. In this case, you can draw a set of 10,000 Uniform RN with lower and upper bound as -0.1 and 0.1 respectively and multiply today’s closing price with those RN and then average it to arrive at tomorrow’s opening price (stock price prediction is not quite done this way, but hope you got the idea).
On the other hand, if you think monthly movement of a stock index follows a normally distributed pattern with specific mean and SD then you need to draw a set of Normal RN with the specified Mean and SD and apply those RN in the calculation of stock index price. Still differently, if you have a portfolio of risky bonds held with different maturity and if you think that on an average 3 bonds default in a year, then you need the random numbers drawn from a Poisson distribution with parameter λ = 3 to model the actual number of defaults in your portfolio with some certainty.
For any data analyst or practitioner worth her salt, knowing the concept of PD is non-negotiable. It provides the foundation for analytics and inferential statistics. Of the hundreds of distributions, there are just a few we need to know more closely for our daily usage. I have chosen five of them here: Uniform, Binomial, Normal, Poisson and Exponential. I plan to have a post talking about their applications and also use cases of these distributions in quant finance in future.
Anyway, that’s enough theory, now let’s soil our hand with the practical execution of these concepts in Python.
Below is a ready reckoner of the different types of PDs we referred to earlier and the associated random numbers generating functions in Numpy with examples and usage.
Uniform Distribution
A random variable is said to follow uniform distribution when any value it takes, within a given range, is equally likely (i.e. has the same probability). Uniform distribution has the largest variety of random number generating functions in Numpy. Lets take a look at them one by one.
np.random.randint(40, 100, 10)
array([64, 82, 43, 75, 48, 40, 90, 78, 81, 76])
— returns 10 random integers between 40 and 100 from the "discrete uniform" distribution in the half-open interval. Half open interval means the high boundary value is excluded from the random selection but low boundary value is included. Note: you can set the low value as negative.
np.random.randint(10, size = 50)
array([3, 7, 0, 8, 3, 4, 3, 9, 5, 7, 9, 3, 3, 2, 9, 7, 8, 3, 4, 3, 0, 3,2, 8, 8, 0, 6, 2, 3, 0, 2, 9, 8, 8, 3, 4, 0, 9, 0, 7, 5, 9, 7, 3, 4, 4, 5, 2, 3, 0])
— If you pass just one argument to np.random.randint
then that will be considered as high value and the low value will default to zero. For example, the above command returns 50 integers from "discrete uniform" distribution with high value as 10 and low value as 0.
np.random.random_integers(10, size = 50)
array([5, 2, 9, 10, 1, 8, 9, 4, 2, 6, 6, 4, 5, 4, 7, 9, 9, 6, 7, 8, 10, 1, 1, 5, 6, 7, 1, 6, 10, 1, 8, 8, 4, 4, 3, 10, 1, 8, 4, 3, 7, 4, 7, 8, 1, 2, 2, 7, 1, 8])
— returns integers from "discrete uniform" distribution with a couple of minor differences with np.random.randint
. The default low for the np.random.randint
is zero whereas for the np.random.random_integers
it’s 1. Also, the former is half open and the latter is full open.
Note: np.random.random_integers
**** will be deprecated soon.
np.random.random_sample((3,4))
array([[0.61879859, 0.61965074, 0.25047645, 0.9547143 ],
[0.62274851, 0.25153697, 0.79753309, 0.78389591],
[0.30247623, 0.00885101, 0.87795343, 0.48111766]])
— returns a 3*4 array of random float between 0 and 1 from a "continuous uniform" distribution in half open space. np.random.random_sample
has two aliases: np.random.random
and np.random.ranf
np.random.rand(2, 3)
array([[0.40629507, 0.59574751, 0.04639712],
[0.65445587, 0.96331397, 0.45717752]])
np.random.random_sample((2,3))
array([[0.39825425, 0.70531165, 0.33423643],
[0.04037576, 0.34760155, 0.98907074]])
— [np.random.rand](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.rand.html)
also **** returns random float between 0 and 1 from a "continuous uniform" distribution in half open space. The only difference is in how the arguments are handled. With np.random.rand
, the length of each dimension of the output array is a separate argument. With numpy.random.random_sample
, the shape argument is a single tuple.
np.random.uniform(1, 10, size = (3,4))
array([[8.84719423, 7.7877443 , 6.69353674, 1.47646953],
[6.0509388 , 9.49876916, 3.5175086 , 5.32753145],
[8.79560377, 3.93732568, 3.67322864, 9.94109316]])
— returns a 3*4 array of random float between 1 and 10 from a "continuous uniform" distribution in half open space. The earlier functions always return floats between 0 and 1 whereas np.random.uniform can return floats between any user defined range. You can set the low value as negative.
Note: For all the above functions, if no argument is provided then a single value is returned.
Normal Distribution
A random variable has normal distribution when the bulk of its values are around the mean and fewer values are towards the tail. The farther the distribution moves away from the mean, the fewer the values get, thus showing that the average values are more likely to occur (i.e. has higher probability). Lets see a few normal distribution generating functions below:
np.random.randn(3,4)
array([[ 0.05056171, 0.49995133, -0.99590893, 0.69359851],
[-0.41830152, -1.58457724, -0.64770677, 0.59857517],
[ 0.33225003, -1.14747663, 0.61866969, -0.08798693]])
— returns a 3*4 array of random float with mean 0 and standard deviation 1 from a standard normal distribution
np.random.standard_normal((3,4))
array([[ 0.05056171, 0.49995133, -0.99590893, 0.69359851],
[-0.41830152, -1.58457724, -0.64770677, 0.59857517],
[ 0.33225003, -1.14747663, 0.61866969, -0.08798693]])
— same as np.random.randn with the only difference being in the way the arguments are provided. This function takes a tuple, as opposed to integers, to specify the size of the output
np.random.normal(1,2, size=(3,4))
array([[ 1.88245497, 0.3382597 , 5.86154237, 0.49581574],
[ 1.21921968, 4.16496223, -0.81846481, -0.18327332],
[ 1.37520645, 0.34026008, -1.38552922, 0.59024698]])
— helps generating random numbers with a user defined mean and standard deviation. The above example returns a 3*4 array of random float with mean 1 and standard deviation 2 from a standard normal distribution. You can also explicitly mention the loc and scale parameter to clearly define the mean and standard deviation, as shown below
np.random.normal(loc = 1, scale = 2, size=(3,4))
array([[ 1.88245497, 0.3382597 , 5.86154237, 0.49581574],
[ 1.21921968, 4.16496223, -0.81846481, -0.18327332],
[ 1.37520645, 0.34026008, -1.38552922, 0.59024698]])
Normal Discrete Distribution
All the above examples are of functions generating random numbers with Normal Continuous Distributions. There is no default function available to generate Normal Discrete Distributions (i.e. normal random integer values). Below are few alternate ways to to achieve that, the functions generate 10*5 array of random integer with mean 2 and standard deviation 3 from a standard normal distribution.
# There is no direct way to generate random integers for normal distribution, following are few workarounds and the best is 'd' as there is no trailing decimals
a = np.trunc(np.random.normal(2, 3, size=(10,5)))
b = np.rint(np.random.normal(2, 3, size=(10,5)))
c = np.round(np.random.normal(2, 3, size=(10,5)))
d = np.random.normal(2, 3, size=(10,5)).astype(int)
# to replace -0 with 0
a = np.where(a == -0, 0, a)
b = np.where(b == -0, 0, b)
c = np.where(c == -0, 0, c)
print(a); print('n')
print(b); print('n')
print(c); print('n')
print(d)
[[ 6. 0. 0. -1. 4.]
[-4. 7. 0. 2. 1.]
[ 6. -4. 1. 0. 5.]
[-1. 1. 0. 2. 3.]
[-1. 5. 4. 3. 4.]
[ 0. 1. 0. 1. 3.]
[ 0. 0. 0. 0. 0.]
[ 1. -1. 2. 6. 4.]
[ 1. 0. 0. 7. 2.]
[ 0. 2. 8. 2. 3.]]
[[ 3. 1. -1. 1. 1.]
[ 4. 5. 5. 3. 5.]
[ 0. 6. 4. 1. 3.]
[ 2. 5. 7. 9. -2.]
[-2. 0. 2. 5. 3.]
[-4. 1. 4. 3. 4.]
[ 1. 1. 3. 3. 3.]
[ 2. 0. 3. 2. 5.]
[ 6. 3. 1. 0. 3.]
[ 2. 1. 2. 0. 4.]]
[[ 1. 6. 3. 4. -1.]
[ 3. 4. -1. 1. 2.]
[-2. 3. 5. -1. 3.]
[-2. 2. -3. 5. 3.]
[ 2. 0. 6. 8. -4.]
[ 6. 7. 3. -2. 5.]
[ 1. 0. -2. 4. 4.]
[ 0. 4. -1. 4. 2.]
[ 1. 2. 5. 4. 4.]
[ 2. 2. 4. 3. 4.]]
[[ 1 -5 5 8 3]
[ 1 1 1 2 -1]
[ 0 0 2 1 3]
[ 1 4 2 8 -3]
[ 0 4 9 1 2]
[ 1 5 1 4 1]
[-1 2 3 5 1]
[ 0 3 2 2 1]
[ 5 3 7 5 3]
[-2 3 3 4 5]]
Binomial Distribution
A random variable has binomial distribution when it takes only one of the two values in any given situation and the probability of any value occurring is same for each situation (trial). In other words, a binomial random variable shows how frequently a particular event occurs in a fixed number of trials.
np.random.binomial(22, .5, 100)
array([ 9, 11, 9, 12, 12, 15, 10, 10, 10, 12, 10, 10, 11, 8, 10, 12, 11, 9, 8, 9, 12, 14, 11, 9, 8, 8, 12, 13, 11, 8, 8, 10, 12, 11, 11, 13, 10, 9, 11, 10, 10, 12, 7, 11, 11, 12, 10, 8, 8, 11, 11, 11, 11, 13, 11, 15, 9, 13, 10, 14, 16, 12, 6, 9, 10, 16, 6, 11, 12, 7, 11, 13, 8, 11, 11, 12, 12, 11, 12, 8, 13, 12, 15, 8, 12, 10, 12, 9, 10, 13, 13, 12, 13, 10, 9, 9, 4, 12, 12, 7])
— returns 100 random integers from binomial distribution with 22 trials and with success probability of each trial being 50%. Think of it this way, assume that from the historical data the probability of the price of Apple stock going up is 50% on any given day, there are 22 trading days in a month (on an average) and you want to find out the probability of Apple price going up for various number of days in a month. So you conduct 100 experiments of this trial and come out with a result which is shown above.
To make more sense of these numbers, you need to plot a histogram.
import seaborn as sns
plt.xlabel('Nos of days Apple price goes up in a month')
sns.distplot(np.random.binomial(n=22, p=0.5, size=100), hist=True, kde=True))
plt.show

From the histogram above, you can clearly see the probability of Apple stock prices going up for various number of days in a month. The probability of price going up for 11 days in a month is 25%, 14 days in a month is 10%, 4 days in a month is 5 % and so on. It will be even more clear if you plot a countplot as below:
plt.xlabel('Nos of days Apple price goes up in a month')
sns.countplot(np.random.binomial(n=22, p=0.5, size=100))
plt.show

The countplot above shows you that out of the 100 experiments that you conducted – on 20 experiments the Apple price went up for 12 days in a month, on 10 occasions the price went up for 14 days and so on.
A special situation of Binomial is Bernoulli’s distribution where the number of trial is always 1. Below is an example of Bernoulli’s random number generator function
np.random.binomial(1, 0.6, size=100)
array([1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0,1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1,
1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0,
0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0])
Poisson Distribution
A random variable has Poisson distribution when it takes certain values to describe the probability of an event occurring within a given space or time.
np.random.poisson(5, 100)
array([9, 3, 2, 5, 3, 8, 9, 5, 8, 7, 5, 4, 2, 6, 5, 2, 4, 4, 5, 5, 6, 6, 6, 2, 8, 6, 4, 4, 5, 2, 6, 5, 2, 4, 5, 5, 2, 7, 7, 4, 5, 6, 7, 7, 4, 5, 5, 8, 5, 6, 3, 4, 4, 4, 6, 3, 9, 4, 9, 7, 4, 8, 7, 3, 2, 3,
5, 3, 4, 9, 4, 4, 3, 3, 2, 6, 5, 8, 3, 3, 4, 5, 7, 4, 4, 8, 5, 8,
3, 5, 3, 3, 3, 4, 7, 8, 6, 1, 1, 2])
— returns 100 random integers from Poisson distribution with expectation of rate as 5. Think of it this way, assume that the past data shows that on an average, five bonds default in a month from the corporate bond universe. You want to find out the probability of various number of bonds defaulting in a month. So you conducted 100 experiments of this trial and came out with a result which is shown above.
Like binomial, to make more sense of these numbers, you can plot a histogram and countplot. The histogram shows the probability of various number of bonds defaulting in a month while the countplot depicts the count of different bond failure in a month from the 100 experiments you conducted.
plt.xlabel('Nos of bonds default in a month')
sns.distplot(np.random.poisson(5, 100), hist=True, kde=True)
sns.countplot(np.random.poisson(5, 100))
plt.show


Exponential Distribution
Exponential distribution is inversely related to Poisson distribution. A random variable follows Exponential distribution when it takes certain values to describe the time until a specific event occurs. In other words, the waiting time distribution associated with a Poisson process is Exponential distribution.
random.exponential(scale=2.4, size=100)
array([ 9.06140756, 0.95891383, 4.90338297, 2.20290367, 2.56006299, 0.03202048, 1.88378882, 6.63501052, 0.42636625,
.
.
.
0.44203044, 3.54720377, 1.14886891, 11.75031166, 1.59421014,
1.40063617, 3.2650319 , 6.32941804, 1.91424698, 0.95795208])
— returns 100 random integers from Exponential distribution with average waiting time as 2.4 unit of time. Continuing with the Poisson example earlier, if five bonds default in a month then the time between two defaults is 2.4 days (1/5). You want to find out the probability of various number of days between two defaults. So you conducted 100 experiments of this trial and came out with a result which is shown above (to save space, I contracted the output).
Like earlier if you plot the distribution you will be able to decipher it better. The histogram below shows that waiting time of 0 to 2 days before the next default has maximum probability. Note that this conclusion is true only when the rate is 5 defaults in a unit of time or inversely, the decay rate is 2.4 unit of waiting time between two defaults.
plt.xlabel('Nos of waiting time between two defaults in a month')
sns.distplot(random.exponential(scale=2.4, size=100), hist=False, kde=True)
plt.show

Note : By their very designs Binomial and Poisson are discrete distributions, thus these functions will always return integers whereas Exponential is a continuous distribution (again by design) and will always return floats.
Endnote
From Numpy 1.17 onwards, a new random number generator is introduced to generate sample values from various distributions. In the new approach, the Generator is defined as a container class which needs to be instantiated and then the instance is to be used to generate random values from any distribution you prefer. For example:
new_rand = np.random.default_rng()
new_rand.uniform(0,1,5) # generate 5 Uniform Random Variables from Uniform Distribution
For more details on the new generator see this link https://numpy.org/doc/stable/reference/random/
Conclusion
So this was my first post in the blog series, where we explore the linkages between math and the practical application of Data Science and how our math and stats knowledge can help us resolve the data science problems easily and effectively. Hope you found the brief overview on the random number and its related concepts and their usage in Data Science, helpful.
Additional resources and references :
For detailed information about the random variable and random number generation:
https://docs.scipy.org/doc/numpy-1.16.0/reference/routines.random.html
For a visual representation of Univariate Distribution Relationships:
http://www.math.wm.edu/~leemis/chart/UDR/UDR.html
In case you want to know a bit more about the Mersenne Twister process:
https://www.sicara.ai/blog/2019-01-28-how-computer-generate-random-numbers
A good article on how to generate random numbers in Python:
https://machinelearningmastery.com/how-to-generate-random-numbers-in-python/
Thanks for reading! Keep exploring, for any query or suggestion please feel free to drop in a line to me at: