The world’s leading publication for data science, AI, and ML professionals.

How To Find Probability From Probability Density Plots

To understand the distribution of data by knowing the actual probability within a range

Before going any further, let’s face it.

If you’ve been in Data Science field for quite some time, chances are you might have had made probability density plots (similar as below) to understand the overall distribution of your data.

(Source)
(Source)

Well… First of all, what’s a density plot? A great and clear explanation by Will Koehrsen is this:

A density plot is a smoothed, continuous version of a histogram estimated from the data.

The most common form of estimation is known as kernel density estimation (KDE).

In this method, a continuous curve (the kernel) is drawn at every individual data point and all of these curves are then added together to make a single smooth density estimation.

The kernel most often used is a Gaussian (which produces a Gaussian bell curve at each data point)

A Probability density plot simply means a density plot of probability density function (Y-axis) vs data points of a variable (X-axis).

Typically, probability density plots are used to understand data distribution for a continuous variable and we want to know the likelihood (or probability) of obtaining a range of values that the continuous variable can assume.

But here is the thing.

By showing probability density plots, we’re only able to understand the distribution of data visually without knowing the exact probability for a certain range of values.

In other words, it’s hard to quantify the probability under the curve by just looking at the plot.

However, getting the exact probability under the curve is extremely important (I’ll tell you why in the next section), especially when you’re presenting to business stakeholders.

In this article, I’ll show you the full code I used to calculate probability and explain to you step by step on how you can do it as well.

By the end of this article, I hope you’ll understand the distribution of data better by calculating the actual probability within a range of values and subsequently be able to convince stakeholders with your insights.

You can get the dataset and jupyter notebook from my GitHub.

Let’s get started!


Why probability density plots are not convincing enough?

Looking back at the density plot above visually, you may have come to a conclusion that Alaska Airlines flights tend to be earlier more often than United Airlines.

Imagine now your boss asks this question and challenges your statement, "How earlier Alaska Airlines flights are compared to United Airlines and how high the chances of this occurrence? Do you have any numerical evidence to show that your conclusion is correct?"

You’re stunned. Because the conclusion comes from your observation of the overall distribution of data.

Even worst, now you don’t have any numerical evidence – exact probability -to support your claim.

You’re not well prepared and your credibility as a data scientist instantly falls apart if you’re not able to prove your points.

This is where the importance of calculating probability from probability density plots comes in.

Unfortunately, it’s very hard to calculate probability if you use Seaborn to make density plots using distplot.

After spending some hard time figuring out how to calculate probability, I decided to use KernelDensity from sklearn.

The method works like charm and I’m so excited to share this with you! 👇🏻


Here’s how to find probability from probability density plots

We’ll use the tips data that consists of some factors that could affect the amount of tips given by customers in a restaurant. You can get the data here.

Tips data loaded in a dataframe
Tips data loaded in a dataframe

Since our goal here is to find probability from density plots, to make it simple, we’ll focus on answering one question – Do customers give more tips during lunch/dinner time?

Since the data is already cleaned enough, hence we can start plotting the density plots and calculating respective values of probability directly.

For the following steps, please refer to the notebook for the full code implementation with the functions used.

1. Make probability density plots

Since Seaborn doesn’t provide any functionality to calculate probability from KDE, thus the code follows these 3 steps (as below) to make probability density plots and output the KDE objects to calculate probability thereafter.

  • Plot normalized histograms
  • Perform Kernel Density Estimation (KDE)
  • Plot probability density
Probability density plot of tips amount (USD) given by customers
Probability density plot of tips amount (USD) given by customers

Now that we have the probability density plot of the amount of tips for lunch and dinner time for comparison.

By just looking at the range from 1–3 USD of tips given by customers, we may conclude that customers tend to give 1–3 USD of tips during lunch time compared to dinner time.

Again, in order to have numerical evidence (aka probability values) to reinforce our statement, let’s calculate the probability of customers giving 1–3 USD tips during lunch and dinner time for comparison.

2. Calculate probability

Once we’ve made probability density plots with the function plot_prob_density, we’ll have the output KDE objects from this function as an input to calculate probability using next function – get_probability.

Calculate and output probability
Calculate and output probability

And there you have it!

The sum of probability under a normalized density curve is always equal to 1. Since probability is the area under curve we can then specify a range of values (1–3 USD tips in this case) to calculate the probability within this range.

Therefore, probability is simply the multiplication between probability density values (Y-axis) and tips amount (X-axis).

The multiplication is done on each evaluation point and these multiplied values will then be summed up to calculate the final probability.

The calculated probabilities turn out to support our original claim – Customers tend to give 1–3 USD of tips during lunch time compared to dinner time with the probability of 63% over 49%.


Final Thoughts

Photo credit: Campaign Creators
Photo credit: Campaign Creators

Thank you for reading.

I hope this articles gives you a better understanding of probability density plots and most importantly, shows you how to calculate actual probability within a range of values under a probability density curve.

Calculating probability is fairly simple but not trivial. It does play a crucial part to give stakeholders a better understanding of your probability density plots to come up with actionable insights based on numerical evidence instead of subjective and ambiguous observation.

As always, if you have any questions or comments feel free to leave your feedback below or you can always reach me on LinkedIn. Till then, see you in the next post! 😄


About the Author

Admond Lee is currently the Co-Founder/CTO of Staq the #1 business banking API platform for Southeast Asia.

Want to get free weekly data science and startup insights?

Join Admond’s email newsletter – Hustle Hub, where every week he shares actionable data science career tips, mistakes & learnings from building his startup – Staq.

You can connect with him on LinkedIn, Medium, Twitter, and Facebook.

Admond Lee


Related Articles