Explain confidence interval to a business user

How to answer the famous data science interview question concisely

Aparna Gopakumar
Towards Data Science

--

Image from Coursera-stats

When we start our journey as a data scientist, one of the few topics that anyone may get stuck with is ‘Confidence interval’. It is my internship experience that gave me a clear understanding of this concept. Lately when I was asked in an interview “How would you explain confidence interval to a business user?”, I still struggled to phrase my answer properly.

This post is for everyone who is struggling to understand the concept of the confidence interval and also for those who understand the concept but not able to communicate it to non-technical audiences. My aim here is to just give a vanilla intuition on this topic. Consider this as a starting point on which you can build your concepts.

This post has two sections. In the first one, I will try to explain the concept of the confidence interval in plain English. In the second section, I will show the math behind the concept.

Confidence interval explained in English

Imagine you are a data scientist at Amazon. Now, Amazon has 100 million customers. Imagine among those 50 million are male customers and 50 million are female customers.

You want to analyze if the spending habit differs between the two genders. The specific question you need an answer for is: On average do women spend more per transaction than men?

One way to determine is to track the amount spent per transaction of all these 50 million female customers, track the amount spent per transaction of all the 50 million male customers, calculate the average and conclude the results. What you did just now is collect the population data and DETERMINE the population parameters. However, you will hardly have time and resources (or sometimes money) to collect the entire population data. Therefore, you rely on classical statistics to solve the problem.

First, let us try to calculate the average spending per transaction of female customers.

Let each transaction amount be represented by Si. Average transaction of all 50 million female customers, i.e., population average is:

Average spending of all 50 million female customers

As stated earlier, since you do not have the time and resources to wait and collect the spending of all the 50 million female customers you take a (random independent) sample of female customers, say 30,000, among these 50 million female customers. You will record the spending per transaction of these 30,000 female customers:

So now you have 30,000 transaction amounts.

Now you will calculate the average of these 30,000 transactions.

Average spending of 30,000 female customers in the sample

Using classical statistics to solve the business case efficiently

In classical statistics, you use this sample average to find out the population average. In this case, you use the average spending of the sample of 30,000 female customers to find out the average spending of the entire population of 50 million female customers.

There are two approaches to do this

  1. You conclude that the average spending of 30,000 female customers, (sample average) is equal to the average spending of all the 50 million female customers (population average). This is called the point estimate where you used the sample data to come up with the best guess of an unknown population parameter.
  2. A better and more convincing way is to use this sample average to find out an interval within which the population average will lie. Using this sample of 30,000 female customers you will calculate the interval within which the average spending of 50 million female customers may lie.

The plausible interval, that you calculated using the sample data, within which the population parameter will lie is called the confidence interval. The width of the interval is mostly decided by the business: 90%, 95%, or 99% being the most common.

In simple English, 95% confidence interval tells you the range within which 95% of the population parameter value, the average spending of 50 million female customers here, lies. Therefore, we can be 95% confident that the population means will lie in the interval. (You must remember that this 95% interval of population data is calculated by only utilizing the information from the sample data)

An important aspect here is how you sample the data. To obtain a sample that is an unbiased representation of the population :

  1. The sample size should not be too small
  2. Samples should be random and independent.

i.e.,

Collecting one data point should not in any way determine your next data point. For example, only because you collected a sample who was a 50-year-old married female as your first data point you shouldn't think that “now I will collect a 20-year old single female”. If you do this, you are introducing bias and the second data point becomes dependent on the first data point. This way you make sure each data point collected in a sample is random and independent of each other. (Read more about selection bias here)

So, next time when you get this question to explain confidence interval in a non-technical way, you can use this idea to phrase the answer.

Imagine you want to figure out some parameter of your interest about the population (customize according to the business). Collecting data of the entire population is time consuming and expensive. So you collect a random sample of data and from this sample, using stastical methods you calculate an interval within which the population parameter may lie. This inteval is called as confidence interval.

Now back to the business problem

To better appreciate the usefulness of confidence interval, let me go back to our initial problem. On average do women spend more money per transaction than men? This is the problem in hand

Just like how you took 30,000 samples of female customers and calculated the average, you will take 30,000 samples of male customers and calculate the average. Let us assume that average spending per transaction for female customers calculated from this sample(Female_avg) is $2350 and average spending per transaction for male customers from the sample(Male_avg) is $1350. Here Female_avg is greater than Male_avg. But before generalizing the result that women spend more money per transaction, remember that you only have the sample information. The sample average might be quite different from the population average even if we obtained the sample correctly. This is where you find the confidence interval of the average spending of male customers and female customers.

Assume that the corresponding 95% confidence interval you get for the average spending of male users is [1150, 1250] and that of female users is [2340, 2360]. This means that 95% of the average spending of (all 50 million) male customers lies between 1150 and 1250. Only 5% of the time (very rare) you will find average spending greater than $1250 for males. Similarly, for females, only 5% of the time will average spending be less than $2340. Thus you can conclude that on average women spend more money per transaction than men.

To be specific, you are checking if the confidence interval is overlapping. If the confidence interval is not overlapping, then we can say that there is a difference.

Suppose you get the 95% CI of average spending of male users to be [2330,2350]. Here you can see that the confidence interval overlap. Thus, we cannot conclude that women spend more money per transaction than men.

I hope I was able to give you an intuitive understanding of the concept. Now time for some maths.

The math behind Confidence Interval

Before diving in, one prerequisite you must have is a good understanding of the Central limit theorem.

The dataset we are using here is Black Friday sales data from Kaggle. The dataset here is a sample of the transactions made in a retail store on Black Friday. For the sake of continuity in explanation, consider this dataset is of Amazon.

Black Friday sales dataset

Remember that we have only a sample of the transactions, 5 million.

The first thing you want to know is whether there is a difference in spending between the two genders. Find the average spending of female customers during Black Friday from this dataset and then compare it with the average spending of male customers.

Let us first start by calculating the average spending of female customers.

Average spending by female

As explained before, this is just the sample average of female users’ transactions. We have to use this $8734 to come up with an interval within which the population average may fall into. To calculate this interval value we make use of the Central limit theorem.

In a nutshell, the central limit theorem states that whatever be the population distribution, if the population distribution has FINITE mean µ and FINITE variance σ, distribution of sample means will follow a gaussian distribution with mean µ and standard deviation σ /√n where n is the sample size, provided the sample size is large enough.

Applying Central limit theorem to this dataset

  1. Randomly sample with replacement 100 (sample size) data points of the female transaction from the dataset

2. Record the average of this sample ($8124)

3. Repeat steps 1 and 2 10,000 times. You get 10,000 values of averages.

Calculating the mean of each of the 100 samples.

4. Plot the distribution of all 10, 000 sample averages collected above. Then according to the Central limit theorem, this distribution will be a Gaussian distribution.

5. Calculate the mean of these sample means, X.

X is approximately calculated as $8700.

The distribution obtained in step 4 will have the mean as X and thus will be centered around X (since the distribution is gaussian) and standard deviation σ /√100. (100 is the sample size). Using this X then, we can give an estimated range of population mean.

But how?

One important property of Gaussian distribution with µ and standard deviation σ is that 95% of the values will lie within a range [µ -2σ, µ +2 σ].

Normal distribution on Wikipedia

Using this property, we can calculate the 95% range centered around X. Remember that X is nothing but mean of sample means. So this 95% range can be understood as: “ 95% of sample means will lie between [µ -2 σ, µ +2 σ]”.

Calculating 95% confidence interval

Let us assume that we know the population standard deviation sigma, let σ be 500. (Big Assumption!!)

Sample size=100

Sample statistic mean =X= $8700

Following the property of Gaussian distribution, 95% of values lie between [8600, 8800].

The interval [8600, 8800] is 95% Confidence interval for the average spending of all the female transactions.

In plain English, we can say that [8600, 8800] covers 95% of the values of the average spending of all female customers. That’s it!

Or we can say that 95% of the time the actual population means will lie in this interval. Or we can say we are 95% confident that the actual population means lies within this range.

Any value outside of this interval [8600, 8800] occurs only 5% of the time. For example, the probability that the average spending of all female customers is 8200 is less than 0.05.

Thus using a sample statistic we are able to give a range of plausible values of an unknown parameter of the population.

One point to be noted is that all this works out only because of the Central limit theorem which says sampling distribution itself is Gaussian distributed. If we did not know that sampling distribution is Gaussian distributed we couldn't have utilized the property and calculated the upper and lower end of intervals.

Now you can guess what does 99% confidence interval tells us. It just gives an interval that covers 99% of the average spending of female customers. Following the same property, it is calculated as [µ -3σ, µ +3 σ]. Thus, a 99% confidence interval is wider than a 95% confidence interval.

For simplicity, I have made a big assumption that we know the population standard deviation. We mostly will not have this information. When the population standard deviation is unknown, we use t-distribution to calculate the confidence interval.

Notes:

  1. In machine learning, because of the digitization of data, it is very easy to get population data. If we have population data, there is no need to calculate C.I. If the cost of obtaining data is high (Medical applications) or obtaining such data is hard (Third-party marketing company) in such a case classical statistics methods like these come in handy.
  2. One use case for calculating confidence interval is AB testing. In AB testing you split your population (users) randomly into two (or more) groups: Control and Challenger. Control and Challenger are two samples of the population. Using these samples you need to determine the behavior of this population.
  3. We use the central limit theorem and calculate the confidence interval only when we need to calculate the confidence interval of the mean. To estimate the confidence interval of standard deviation, median or 90th percentile, etc. we use bootstrapping. This is because CLT does not work for standard deviation or median but for any function which is an addition or addition followed by an operation.

I would love to know if this post has helped you or if you have any feedback. Let me know through your comments.

Also, check out a simple explanation of another confusing topic.

Leave a clap if this post has helped you!

--

--