Why Probability distribution is must in DS/ML?

A comprehensive guide

Published in

Towards Data Science

12 min readJan 2, 2020

As the name suggests probability distribution is the distribution of total probability across all the possible outcomes of a random variable.
For example assume a bank provides four different kinds of debit cards(Classic, Silver, Gold, Platinum) to its customers. Each of these debit cards has specific values for different parameters like maximum transaction amount, reward points, annual service cost, the maximum purchase amount, etc.

The bank provides debit cards based on customer features like age, educational status, Profession, salary, size of family, place of residence, etc.

Now assume a software professional opens account in the bank then what kind of debit card should be provided to him. After analyzing all the customer features the bank comes with a probability table as shown below.

Distribution of probabilities across debit card types

The above table represents the probability distribution of debit cards where total probability(1.0) is distributed across all the four types of debit cards with their corresponding probability values.

Probability distribution is essential in -

Data Analysis
Decision making

This blog emphasizes the need for probability distribution in the above two contexts, types of probability distributions and different types of tests for normality check.

Let us discuss it with Australian athletes' data set available on kaggle. This sample data contains various physical attributes of the athletes. It contains 202 observations of 13 different features like height, weight, body mass index, sex, red blood count, white blood count, etc.

Data Source: https://www.kaggle.com/vikashrajluhaniwal/australian-athletes-data-set

Now, let’s try to build a simple classification model to classify Sex of an athlete using one variable at a time. For this purpose, probability density function(PDF) is very helpful to assess the importance of a continuous variable.

Additional NOTE : This article contains only the final visualization output in interactive mode for different plots. In order to access the python code follow the kaggle kernel here( https://www.kaggle.com/vikashrajluhaniwal/5-must-know-probability-distributions).

a. Analyzing rcc

Since the pdf’s of rcc across both the gender categories are slightly overlapping each other so it is not possible to correctly classify all the points based on some condition. A simple model can be built by considering the intersection point of both the pdf’s

if(rcc<=4.68):
 sex = “female”
else:
 sex = “male"

From the pdf curve we can observe that up to 4.68(value at intersection) more number of points belong to the female category than man, similarly, after 4.68 there is more number of male points. But this model leads to misclassification for the points belonging to the female category after 4.68 and similarly for the points belonging to the male category before 4.68.

Similarly, we can consider other variables to build a simple classification model like this. To find out the most significant variable, univariate analysis using pdf can be performed.

Uni-variate analysis using pdf

Rules of thumb to determine the most significant features using pdf:-

Higher the separation among pdf curves for different classes of target variable better the classification

Overlapping PDF curves lead to the worst classification

Now let’s perform univariate analysis on a subset of features using PDF.

b. Analyzing wcc

Since the pdf curves of wcc for both the gender categories are overlapping each other so it is very hard to come up with a condition for simple classification.

c. Analyzing bmi

Since the pdf curves of bmi for both the gender categories are overlapping each other so the simple classification model will have a high level of misclassification.

d. Analyzing pcBfat

A simple model(as shown below) can be built using pcBfat but again it leads to some misclassification.

if(pcBfat<=11.80):
    sex = "male"
else:
    sex = "female"

e. Analyzing lbm

A simple model(as shown below) can be built using lbm but again it misclassifies the male points below 64.10 as female and the female points above 64.10 as male.

if(lbm<=64.10):
    sex = "female"
else:
    sex = "male"

Uni-variate analysis using CDF

All the above simple models were leading to some sort of misclassification but using PDF we are unable to compute the level of misclassification. The magnitude of misclassification error can be obtained using CDF curve.

a. Analyzing rcc

From the CDF curves we can observe that up to 4.68(condition value from the built model) 12.74% of the male are misclassified as female, similarly, 19% of the female are misclassified as male above 4.68.

So the total misclassification error = 12.74 + 19 = 31.74%

b. Analyzing pcBfat

From the CDF curves we can observe that up to 11.80(condition value from the built model) 18% of the female are misclassified as male, similarly, 14.70% of the male are misclassified as females above 11.80.

So the total misclassification error = 18 + 14.70 = 32.70%

c. Analyzing lbm

From the CDF curves, we can observe that up to 64.10(condition value from the built model) 12.74% of the male are misclassified as female, similarly, 7% of the female are misclassified as males above 64.10.

So the total misclassification error = 12.74 + 7 = 19.74%

Conclusion

The best predictor based on misclassification error among rcc, pcBfat, lbm is lbm.

Types of Probability distribution

Based on the type of a random variable(discrete or continuous) there are two types of Probability distributions — Discrete and Continuous. In this blog, we are going to discuss the following probability distributions.

Discrete probability distribution

Discrete uniform
Binomial distribution

2. Continuous probability distribution

Continuous uniform
Normal distribution
Log-normal distribution

1. Normal distribution

Normal(Gaussian) distribution is a bell-shaped curve, its distribution pattern is observed in most of the natural phenomena such as height, weight, marks, etc. It has two parameters — mean(µ) and standard deviation(σ).

PDF of a random variable which follows normal distribution is given as

Properties of Normal Distribution

Mean = median = mode
Symmetric in nature
The total area under the curve = 1
As we move away from the mean, the PDF value decreases
As the variance increases the distribution spread also increases and the curve becomes more wider
68–95–99.7 Empirical rule
- 68.2% of the data lies within one standard deviation away from the mean
- 95% of the data lies within two standard deviations away from the mean
- 99.7% of the data lies within three standard deviations away from the mean

If we know in advance that a variable follows normal distribution then we can easily tell many properties of the variable without looking at the actual data.

Normality check of height, weight column through visualization and skewness

From the curves, we can easily observe that height and weight are almost normally distributed but there exists a small amount of asymmetricity which can be measured through skewness.

skewness: it is a statistical parameter to measure asymmetricity about the mean in a distribution of the random variable. This parameter value can be positive, negative or undefined. A negative value indicates data is left-skewed whereas a positive value indicates data is right-skewed.

Here, height is slightly skewed to the left whereas weight is slight to the right.

Now let’s try to answer the below questions.

What % of athletes have height <=165 cm?
What % of athlete has a height between 165 and 185 cm?
What % of athlete has height >185 cm?

For a random variable with finite mean and standard deviation, these above questions can be easily answered using Chebyshev’s inequality.

Assume for a moment that height strongly follows normal distribution then its distribution would have looked like below -

If the height feature had perfectly followed normal distribution then above questions would have easily answered using CDF of normal distribution

Normality test

The normality tests are used to determine whether the data is normally distributed or not OR whether the sample data comes from a normally distributed population or not.
There are various kind of graphical and numeric tests to determine this.
1. Graphical tests

Histogram/density plot
Q-Q plot

2. Numeric tests

Shapiro-Wilk test
Kolmogorov-Smironv test

a. QQ Plot

It is a graphical method for comparing two probability distributions by plotting their quantiles against each other. For normality test one distribution is w.r.t. the given sample that we want to test and another distribution is the standard normal distribution.
There are built-in methods available in statsmodels and scipy package to plot the Q-Q plot. We can also plot it manually.

Steps for plotting Q-Q plot manually

Assume X is a random variable representing the given sample and Y~(0,1) is a random variable that follows the standard normal distribution with mean(µ) equals to 0 and standard deviation(σ) equals to 1.

Compute all the percentiles of X. Say x’₁, x’₂, x’₃………..x’₁₀₀. These percentiles are also known as sample quantiles.
Compute all the percentiles of Y. Say y’₁, y’₂, y’₃………..y’₁₀₀. These percentiles are also known as theoretical quantiles.
Plot each percentile of X against the same percentile of Y. i.e. 2d points are formed as (x’₁,y’₁), (x’₂,y’₂), (x’₃,y’₃)…………..(x’₁₀₀,y’₁₀₀).
If all the points lie on a straight line(reference line y=x) then X follows a normal distribution.

From the plot can we assume height feature follows a normal distribution?

Here, most of the points fall about the reference line so our assumption seems to fairly safe.

b. Shapiro-Wilk test

It’s a numeric test to check whether a sample is normally distributed or not. It is a hypothesis based test where the null and alternate hypothesis is defined as below -

H₀(Null Hypothesis): Sample is normally distributed

H₁(Alternate Hypothesis): Sample is not normally distributed

This, if the p-value obtained for the W statistic is less than the significance level(α) then the null hypothesis is rejected, on the other hand, if the p-value is greater than α then we failed to reject the null hypothesis.

Here for α = 0.05, obtained p-value(0.2120) > α , so we failed to reject null hypothesis i.e height came from a normally distributed population.

c. Kolmogorov–Smirnov(K-S) test

K-S test provides a way to —

check whether a sample is drawn from a reference probability distribution or not(one-sample K–S test)
check whether two samples are drawn from the same distribution or not(two-sample K–S test)

It is a hypothesis based test where null and alternate hypothesis for one-sample K–S test is defined as below -

H₀(Null Hypothesis): Sample follows the reference distribution

H₁(Alternate Hypothesis): Sample does not follow the reference distribution

Here for α = 0.05, obtained p-value(0.7958) > α , so we failed to reject null hypothesis i.e height follows normal distribution.

Chebyshev inequality

By following 68–95–99.7 Empirical rule of a normally distributed dataset, we know that what % of data lies within k standard deviation from the mean but what if the data does not follow normal distribution? OR how to know what fraction of data lies within k standard deviation from the mean for any random distribution?

To answer such questions pertaining to the dispersion of data for a random distribution Chebyshev inequality is used.

Chebyshev inequality states that no more than 1/k² fraction of data falls more than k standard deviations away from the mean

In simplified terms

In other words, at least 1- (1/k²) fraction of data falls within k standard deviations from the mean for a sample with finite mean and finite standard deviation.

Let’s explore the inequality more with a few values of k.

For k = 2, 1- (1/k²) = 0.75 i.e. at least 75% of the data falls within two standard deviations of the mean for any random distribution.
For k = 3, 1- (1/k²) = 0.89 i.e. at least 89% of the data falls within three standard deviations of the mean for any random distribution.

Let’s try to answer the following questions using Chebyshev inequality.

What % of athletes have a height between 160.68 and 199.52 cm?
What % of athletes have a height between 150.97 and 209.23 cm?

2. Log-Normal distribution

A random variable X is said to be log-normally distributed if the natural logarithm of X is normally distributed. In other words, X ~LogNormal(µ,σ) if log(X) is normally distributed.

PDF of a log-normally distributed random variable is given as

Let’s plot the distribution of ferr feature.

We can observe that ferr feature follows some sort of log-normal distribution where the right portion has long tall than the left one.

Comparing the distribution of ferr against log-normal distribution using QQ plot

QQ plot can be used to compare two probability distributions by plotting their quantiles against each other.

From the above QQ plot, we can observe that most of the points are not closely residing on the reference line so ferr feature does not strictly follow a log-normal distribution.

3. Binomial distribution

The binomial distribution is a discrete probability distribution for obtaining exactly k successes out of n Bernoulli trails.

Characteristics of Bernoulli trails -

Each trail has only two possible outcomes — success and failure.
The total number of trails is fixed.
The probability of success and failure remains the same throughout all the trails.
The trails are independent of each other.

The binomial distribution is a way of calculating the probability of k successes from n Bernoulli trails.

The PMF of a binomial random variate is given as

where p = probability of success and (1-p) = probability of failure

k = number of successes and (n-k) = number of failures

From the underlying dataset, we can observe that only 12.37%(25/202) athletes play Basketball, Now if we choose a random sample of 50 athletes then

What is the probability that exactly two athletes play basketball?
What is the probability that at most 10 athletes play basketball?
What is the probability that at least 20 athletes play basketball?

Since all the above questions hold a varying number of successes(2,10,20) from a fixed number of trails(50) with p = 0.1237 so binomial distribution can be used to answer these questions.

4. Uniform distributions

4.a Discrete uniform distribution

A discrete uniform distribution is a symmetric distribution with the following properties.

It has a fixed number of outcomes.
All the outcomes are equally likely to occur.

If a random variable X follows discrete uniform distribution and it has k discrete values say x₁, x₂, x₃………..x₁₀₀, then PMF of X is given as

From the given dataset, we can observe that sex feature has two possible values male and female. There is an almost equal number of male(100) and female(102) athletes so if we assume that sex feature strictly follows uniform distribution then

So there is 50% chance that a random athlete will be man, similarly, there is 50% chance for a random athlete to be female.

4.b Continuous uniform distribution

If a continuous uniformly distributed random variable X is defined in a and b then PDF of X is given as

Let’s consider a subset of data having wcc values between 4.40 and 5.40 for one set and between 7.70 and 9.90 for another set. The distribution of this subset of data is as shown below where probability is same across all three bins of continuous range.

End Notes

In this journey so far, we discussed different kinds of probability distributions by giving special importance to the need for probability distribution in DS/ML context.