The world’s leading publication for data science, AI, and ML professionals.

5 Probability distribution you should know as a data scientist

Confused with probability as a data scientist?

Must know probability concepts and distribution

Photo by Ryunosuke Kikuno on Unsplash
Photo by Ryunosuke Kikuno on Unsplash

Data scientists come across many terms related to probability while solving problems in interviews and reading research papers. Hence, knowing the basics of probability and the probability distribution is a must for an aspiring data scientist or even a seasoned one. This knowledge will help you ace the interviews, understand data better, and develop more intuitive solutions. The blog will have the following sections.

  • Probability basics
  • Random Variable
  • Probability Distributions and it’s characteristic
  • Uniform Distribution
  • Binomial Distribution
  • Gaussian Distribution
  • Poisson Distribution
  • Exponential Distribution

Don’t worry. The list is huge, but I have made sure to make this blog readable and easily understandable. Without further ado, let’s delve into understanding all the concepts.

Probability Basics

Let’s first understand the meaning of experiment, sample space, and event, as they are frequently used in statistics and will also help us understand the formal definition of probability. An experiment is any procedure that can be infinitely repeated and has a well-defined set of possible outcomes. For example, tossing a coin is an experiment as we can repeatedly toss a coin, and it has a set of 2 outcomes (heads or tails). The set of different possible outcomes of an experiment is called sample space which in our example will be {head, tail} for the tossing of a coin. Lastly, each trial of an experiment is called the event. Now let’s understand probability. Probability is the likelihood that an event occurs measured by the ratio of favorable outcomes to overall outcomes, given that all the outcomes are equally likely.

Image By Author
Image By Author

So in the tossing of a coin, the

Image By Author
Image By Author

probability of getting head = 1/2 when we consider this over infinite trials.

Random Variable

Random Variable (RV) is a function that assigns values to each outcome of an experiment. For example, in the tossing of a coin, we define a random variable X as the event when a head comes, then let’s see how this becomes a function.

When output is heads, X=1

When output is tails, X=0

So p(X=1) = probability of getting head = 1/2

and p(X=0) = probability of getting tail = 1/2

To understand it better, let’s take one more example of tossing three coins. Let the random variable X be the number of heads that occur on the three coins.

Image by Author
Image by Author

p(X=1) = probability of getting one head = 3/8


Probability Distributions and its characteristic

Probability distributions are the collection of data points that describes the likelihood of occurrence of an event. A probability distribution may be either discrete or continuous. A discrete distribution is one in which the data can only take on certain values, while a continuous distribution is one in which data can take on any value within a specified range (which may be infinite). This collection of data can be visualized graphically, as shown below.

Image by Author
Image by Author

Okay, now I get the feel about what is a probability distribution. But how is it relevant for Data Science? In data science, we often form judgments about the parameters of a population and the reliability of statistical relationships based on random sampling of data. For those cases, probability distributions help us make these judgments.

Every data distribution has different shapes on the graph. So there must be some metric that could help us understand the shape of distribution without actually plotting data on the graph. The metrics which can provide information about distribution are: Mean, Variance, and Standard Deviation. Let’s understand each of them.

Mean

It is the average of the data points and is denoted by μ. For example, if we have a discrete set of data as {1,2,3,4,5} then mean (μ) will be 3 ((1+2+3+4+5)÷5). It is used to find the number which, when subtracted from all data points, then the average of transformed data will be zero.

Variance

The variance is the average of the square of the difference between the data point and mean. It is denoted by σ². For the above example, the variance (σ²) will be 2.5 ((1–3)²+(2–3)²+(3–3)²+(4–3)²+(5–3)²)÷5).

Standard Deviation

It is the square root of the variance and denoted by σ. For the above example, the standard deviation (σ) will be 1.58 (sqrt{2.5}). It is used to measure how spread out the numbers are in a dataset. A small standard deviation means the data points are closer to each other.

Uniform Distribution

We have understood what probability distribution is and what its characteristics are. Let’s now understand uniform probability distribution. A uniform distribution is the simplest probability distribution which is also known as rectangular distribution. This distribution has a constant probability. The most common example for this type of distribution could be tossing a coin or rolling a dice.

For discrete probability distribution-

Image By Author
Image By Author

For continuous-:

Image By Author
Image By Author
Image By Author
Image By Author

Uniform distribution is used for bootstrapping technique for calculating confidence intervals. Also, Monte Carlo simulation starts by generating uniformly distributed pseudo-random numbers.

Binomial Distribution

In binomial distribution, the random variable is defined as the number of successes in n number of independently repeated trials. Let the probability of success be p, so the formula for the binomial probability distribution is given as

Image By Author
Image By Author
Image by Author
Image by Author

Example: If you buy a lottery ticket, you either win money, or you don’t. Any event you can think about which has two possible outcomes can be represented by a Binomial distribution. In data science, the binomial distribution is beneficial to analyze the statistics of binary classification problems.

Gaussian/Normal Distribution

It is one of the most famous distribution, and many real-world phenomena such as the error in measurement, the height of people, marks of people in a test, etc., follow this distribution. The formula for this distribution is as follows:

Image by Author
Image by Author

As discussed above, μ is mean, and σ is the standard deviation.

Image By Author
Image By Author

Note that this distribution has a bell-like structure, and the peak of the bell comes at the mean value, whereas standard deviation is related to the width of the bell.

The normal distribution becomes standard normal distribution when the mean equals 0, and the standard deviation is equal to 1.

This distribution has wide applicability in the life of a data scientist and is a must-know distribution. There are many Machine Learning models such as Least Square-based regression, Gaussian Naive Bayes Classifier, Linear and Quadratic Discriminant Analyses, etc. are designed to work on datasets that follow a normal distribution.

Poisson Distribution

Poisson distribution is often referred to as the distribution of rare events. If you have an event that occurs with a fixed rate in time, e.g., 5 people entering the stadium each second or 2 mangoes ripening every minute on a farm. Then the Probability of observing n number of events in a unit time can be calculated using Poisson distribution, using the below formula.

Image by Author
Image by Author

where μ is the event rate in a unit time.

Image by Author
Image by Author

Many real-world phenomena like car accidents, traffic flow, genetic mutations, and the number of typing errors on a page follow the Poisson distribution. Many shopkeepers use Poisson distribution to forecast the number of customers that will come to their shop.

Exponential Distribution

The exponential distribution is closely related to the Poisson distribution. If a Poisson event occurs at a fixed time interval, the time interval between two consecutive Poisson events is exponentially distributed. The probability of having a time interval t between two consecutive Poisson events is as follows

Image by Author
Image by Author

where τ is the average time interval between two consecutive Poisson events.

Image by Author
Image by Author

The exponential distribution has limited usage in data science. In general, if you want to move from the Poisson process (in which you study the number of events) to the time domain, then exponential distribution is the go-to distribution.

Conclusion

So we have discussed 5 different Probability Distributions and seen the use cases of each distribution in the life of a data scientist. I hope you liked this article and would always love to listen to your feedback on improving the blog’s readability.

Become a [Medium](https://medium.com/@AnveeNaik) member to unlock and read many other stories on medium. Follow us on Medium for reading more such blog posts.


Related Articles