The world’s leading publication for data science, AI, and ML professionals.

The Ultimate Guide to Hypothesis Testing and Confidence Intervals in Different Scenarios

a step-by-step tutorial for one-sample and two-sample mean, proportion statistical inference

Photo by Markus Winkler on Unsplash
Photo by Markus Winkler on Unsplash

Getting Started

Statistical inference is the process of making reasonable guesses about the population’s distribution and parameters given the observed data. Conducting Hypothesis Testing and constructing confidence interval are two examples of statistical inference. Hypothesis testing is the process of calculating the probability of observing sample statistics given the null hypothesis is true. By comparing the probability (P-value) with the significance level (1-ɑ), we make reasonable guesses about the population parameters from which the sample is taken. With a similar process, we can calculate the confidence interval with a certain confidence level. A confidence interval is an interval estimation for a population parameter, which is point estimation plus and minus the critical value times sample standard error. This article will discuss the standard procedure of conducting hypothesis testing and estimating confidence intervals in the following different scenarios:

different scenarios
different scenarios

This article is both served as a tutorial for statistical inference, as well as a cheat-sheet for your reference. The sections below will discuss the procedures in detail, and at the end of the article, I will summarize discussions in two tables for convenience.


1, Statistical Inference for Mean

1.1 Distribution Assumptions

We need to have assumptions about the underlining distributions when using statistical inference techniques. According to the central limit theory, the distribution of sample means approaches to normal distributions as sample size increases, no matter what the population distribution is. The samples’ means thus follow the normal distribution if the sample sizes are large enough.

The test we usually use here is either the student t-test or the Z test. Z test is based on the normal distribution while student t-test is based on a distribution similar to a normal distribution, but with fatter tails. When the sample size is lower than 30 (the standard cut-off) or the population standard deviation is unknown, we use the student t-test. Otherwise, we use the Z test.

1.2 One-Sample Mean

For a sample with n observations:

We observe Ᾱ as the mean of the sample. We can test whether this sample is drawn from a population with mean equals to μ by checking whether Ᾱ differs significantly from μ. We can also estimation a 95% confidence interval for the population mean where this sample is drawn from.

Hypothesis Testing

Here are the steps for conducting hypothesis testing:

  • Step 1: Set up the null hypothesis:

Two tails:

H0: Ᾱ = μ

H1: Ᾱ != μ

One tail:

H0: Ᾱ ≥ μ

H1: Ᾱ < μ

or:

H0: Ᾱ ≤ μ

H1: Ᾱ > μ

The alternative hypothesis H1 is the hypothesis we want to test. For example, if we want to test whether Ᾱ is larger than μ, we set H1 as Ᾱ > μ.

  • Step 2: Calculate the test statistic:

For the student t-test, we need to use the sample standard deviation s to estimate population standard deviation σ:

sample standard deviation s
sample standard deviation s

and the t statistic is:

t-test statistic
t-test statistic

Keep in mind that for the student t-test, since the observation is relatively smaller in the sample, we need to specify the degree of freedom to find the right value. The degree of freedom is defined as n-1, where n is the sample size.

If we know the population standard deviation σ, and the sample size n is greater than 30, we can use the z test and calculate the z statistics:

z test statistic
z test statistic
  • Step 3: Compare the critical value to test statistic

To get the critical value, we need to specify the significance level 1-ɑ and refer to the t or z tables. For example, for a two-tailed t-test with a sample size equals to 10 and a significance level of 95%, the critical value is 2.262,, as highlighted below:

t table from https://www.sjsu.edu/faculty/gerstman/StatPrimer/t-table.pdf
t table from https://www.sjsu.edu/faculty/gerstman/StatPrimer/t-table.pdf

For a two-tailed z-test with a significance level of 95%, the critical value is 1.96, as highlighted below:

Z table from https://www.math.arizona.edu/~rsims/ma464/standardnormaltable.pdf
Z table from https://www.math.arizona.edu/~rsims/ma464/standardnormaltable.pdf

The graph below shows the meaning of critical value. Z test is based on a standard normal distribution (N(0,1)). The nature of the distribution indicates that for a random variable x that follows N(0,1), there are only 5% of the chance for |x| ≥ 1.96. Critical value at 1.96 is associated with a 95% (1–5%) significant level. If the z test statistic calculated above is larger than 95%, it means the probability of observing this sample statistics (p-value) is less than 5%. Thus we can reject the null hypothesis at the 95% significance level.

critical value and significant level for a two-tailed test
critical value and significant level for a two-tailed test

Note that at the same significance level, 95%, the critical value for the t-test is larger than the value for the z-test, which is corresponded with the fact that the t distribution has fatter tails.

Confidence Interval

The confidence interval is an interval estimate with a certain confidence level for a parameter. It is calculated by the point estimation plus or minus the margin of errors (ME):

confidence interval
confidence interval

The point estimate is just the mean of the sample, and ME is calculated by:

margin of errors
margin of errors

The distribution and the confidence level define the critical values, and the standard error (SE) is calculated through the sample or population standard deviation. For one sample mean’s confidence interval, if we do not know the population variance, or when the sample size is too small, we can calculate it by:

one sample mean confidence interval
one sample mean confidence interval

where Ᾱ is the sample’s mean, and t can be found in the t table above based on the confidence level and degree of freedom. For example, for a sample with 10 observations, the t value for the 95% confidence interval is 2.262.

Otherwise, we need to use the z table to calculate the confidence interval:

one sample mean confidence interval
one sample mean confidence interval

The z value can be found in the z table. The z value for the 95% confidence interval is 1.96.

1.3 Two Samples Mean: Independent Samples

When we observe two samples, we may wonder whether the means from the two samples differ significantly from each other. If we have reasons to believe that the two samples are uncorrelated with each other, we can test it either use hypothesis testing with the null hypothesis states the means equal to each other, or conduct a confidence interval for the difference of the means and check whether zero is inside the interval. The procedures are quite similar to the one-sample case, with a bit of difference in calculating the test statistics and standard error.

Hypothesis Testing

For two samples with mean Ᾱ1 and Ᾱ2, we can set up the null hypothesis and alternative hypothesis for a two-tailed test like this:

H0: Ᾱ1 = Ᾱ2

H1: Ᾱ1 != Ᾱ2

We can also set up the one-tailed test(H0: Ᾱ1 > Ᾱ2) if we want to check either one of the means is significantly larger than the other. If both samples are not large enough, we can use the t table assuming a t distribution, and calculate the t statistics as follows:

t statistic for two independent samples' means
t statistic for two independent samples’ means

where S1² and S2² are the variances from the two samples that calculate by:

sample 1's standard deviation
sample 1’s standard deviation
sample 2's standard deviation
sample 2’s standard deviation

Depending on the practical situation, we can also set the null hypothesis to check whether the difference between the two means are greater than a certain number that is larger than 0, which sometimes is referred as the effect size. A larger effect size makes it easier to reject the null hypothesis since the difference is bigger, thus increases the statistical power. For more details, you can check out my article here:

How is Sample Size Related to Standard Error, Power, Confidence Level, and Effect Size?

Note that when we calculate the collective standard error as above, we have the assumption that the two samples come from the population that have different variances (σ1² != σ2²) . When we believe σ1² = σ2², we can calculate the pooled standard error:

pooled standard deviation
pooled standard deviation

and calculate the standard error for the test statistics as follows:

standard error
standard error

The test statistics for the null hypothesis becomes:

test statistic for two-sample means
test statistic for two-sample means

Confidence Interval

The confidence interval for two sample means are used to describe the difference of the two mean. Using the t critical value, we can calculate the confidence interval as follows:

confidence interval for two sample means
confidence interval for two sample means

Note that similar to the discussion above, with different assumptions about the population variance, we can calculate the standard error in the margin of error term differently.

1.4 Two Samples Mean: Paired Samples

In the previous section, we have discussed the situation when the two samples are independent from each other. What about the situation when two samples are correlated with each other in some way? For example, the two samples come from the same subjects before and after the treatment, or the samples were taken from different people in the same household, etc. We usually have n1 = n2 in these cases. For example, if we want to test whether there are treatment effect in the treat group, we can collect samples before and after the treatment:

sample before treatment
sample before treatment
sample after treatment
sample after treatment

We need to calculate the difference before and after treatment for each individual, and get the sample of observed difference:

sample of difference
sample of difference

where

In such a way, we have transformed a two-samples case into a one-sample case. Following the procedure discussed above, first we need to calculate the mean and standard deviation of the sample of difference:

mean of the sample of difference
mean of the sample of difference
standard deviation of the sample of difference
standard deviation of the sample of difference

Hypothesis Testing

We can set up the null hypothesis based on the practical situation. Typical null hypothesis and alternative hypothesis for a two-tailed test are:

H0: Ᾱ^d = μ

H1: Ᾱ^d != μ

μ can be any number. The test statistics is calculated as follows:

statistic for the sample of difference
statistic for the sample of difference

Depending on the samples, we can choose to conduct t test or z test.

Confidence Interval

We can also construct a confidence interval for the samples differences. We only need the difference mean, and difference standard deviation to construct the interval. A confidence interval based on the student t distribution is:

confidence interval for the sample of difference
confidence interval for the sample of difference

2, Statistical Inference for Proportion

2.1 Distribution Assumptions

Mean measures the central tendency of continuous variables, but it cannot be used in categorical variables. For categorical variables, we can use the proportion of each category’s count in Statistical Analysis. The proportion of category i in a sample with n categories is calculated by:

calculate proportion
calculate proportion

C_i is the number of observations in category i, and N is the sample size (total observations of all n categories).

Here I will use a simple example to illustrate the process. When tossing a coin, we can either get "Head" or "Tail". Rather than following the normal distribution for mean statistical inference, we use the binomial distribution for binary classification proportions. According to the binomial distribution properties, as the sample size gets larger, the binomial distribution approaches a normal distribution. The standard definition of "a large sample" in statistical inference is when np and n(1-p) both larger than 10. If not, we will use the student t distribution for the inference.

2.1 One sample proportion

One sample proportion calculates the proportion of a category in a sample. As discussed above, a use case of one-sample proportion is to test whether a coin is unbiased. With enough number of tosses, the proportion of "Head" should equal the proportion of "Tail" at 0.5 if the coin is unbiased (Law of Large Numbers).

Hypothesis Testing

Hypothesis testing for one-sample proportion follows similar setting up procedures. Using the coin-tossing example above, if we want to test whether a coin is unbiased, giving a sample of coin tossing results:

It is the same as testing:

H0: P_H = P_0

H1: P_H != P_0

To test whether the coin is unbiased, we can set P_0=0.5. Note this is two-tail testing. We can rewrite the null hypothesis to be P_H > 0.5 to see whether this coin is biased towards the Head.

We need to first count how many "Head" are in the sample to calculate P_H. After that, suppose the sample size is large enough, we can then calculate the z Statistics:

z statistic for one-sample proportion
z statistic for one-sample proportion

P_H is calculated from the sample. P_0 is set at 0.5 in this example. The denominator is used to calculate the standard error for this sample (deriving from binomial distribution). Following the same procedure described above, we can use the z table or the t table to find the critical values. By comparing the statistics calculated here, we can decide whether to reject the null hypothesis or not.

Confidence Interval

The confidence interval for proportion follows the same pattern as the statistical inference for mean, which is using the point estimate and margin of errors, except the standard errors are calculated differently here :

confidence interval for one-sample proportion
confidence interval for one-sample proportion

Note that the standard error for hypothesis testing is different from the confidence interval. The former uses P_0 while the latter uses P_H.

2.2 Two Samples Proportion

Two-Samples proportion test compares the proportions in two samples, which is widely used in AB testing. For example, when we compare the conversion rate between the treatment and control group to see whether there exists a significant treatment effect, we need to test whether the difference in the conversion rate is significantly enough. We can use hypothesis testing to test whether the two proportions difference, or construct a confidence interval for the difference.

Hypothesis Testing

Based on the two samples we have, we can calculate the two proportions P1 and P2. To test whether the two proportions are not significantly different from each other, meaning that the two samples could be drawn from the same population, the null hypothesis and alternative hypothesis are:

H0: P1 = P2

H0: P1 != P2

Note that this is a two-tailed test. For one-tailed test, we can check whether P1 is greater than or less than P2 in the null hypothesis.

An important variable we need to calculate for two sample proportion is called P_pool:

pooled proportion
pooled proportion

You can understand it as we are pooling the two samples together, what is the proportion of category i in the pooled sample.

Similarly, if n1P1 and n2P2 are both greater than 10, we can use the z statistics as the distribution follows a normal distribution. If not, we need to calculate the t statistics. The statistics is calculated by:

z statistic for two sample proportion
z statistic for two sample proportion

Note that we are using P_pool to calculate the standard error.

Confidence Interval

Like two sample mean’s confidence interval, two sample proportion’s confidence interval is also used to inference the difference between the two proportions. If both sample sizes are large enough, we can use the critical value z from the z table and calculate the confidence interval as:

The only difference between the confidence interval and hypothesis testing is the calculation of standard error. Instead of using the pooled proportion, confidence interval uses the standard errors for each sample individually.


That’s a lot of information to digest. Here I use two tables to summarize the main takeaways of this article:

  • For one sample mean and proportion:
source: Penn State Stat 200
source: Penn State Stat 200
  • For two samples mean and proportion:
source: Penn State Stat 200
source: Penn State Stat 200

Thank you for reading. Here is the list of all my blog posts. Check them out if you are interested!

My Blog Posts Gallery

Read every story from Zijing Zhu (and thousands of other writers on Medium)


Related Articles