Hypothesis Testing

Parametric Tests — the t-test

One-stop shop for t-tests — from theory to python implementation

Shubhangi Hora

Published in

Towards Data Science

14 min readOct 17, 2021

In my previous article we went through the whats, hows and whys of hypothesis testing with a brief introduction on statistical tests and the role that they play in helping us determine statistical significance. In this article and the coming few, we’ll take a deeper look at statistical tests — the different types of tests, the tests themselves and which test should be used for which situation.

As mentioned before, statistical tests are statistical methods that help us reject or not reject our null hypothesis. They’re based on probability distributions and can be one-tailed or two-tailed, depending on the hypotheses that we’ve chosen.

There are other ways in which statistical tests can differ and one of them is based on their assumptions of the probability distribution that the data in question follows.

Parametric tests are those statistical tests that assume the data approximately follows a normal distribution, amongst other assumptions (examples include z-test, t-test, ANOVA). Important note — the assumption is that the data of the whole population follows a normal distribution, not the sample data that you’re working with.
Nonparametric tests are those statistical tests that don’t assume anything about the distribution followed by the data, and hence are also known as distribution free tests (examples include Chi-square, Mann-Whitney U). Nonparametric tests are based on the ranks held by different data points.

Every parametric test has a nonparametric equivalent, which means for every type of problem that you have there’ll be a test in both categories to help you out.

The selection of which set of tests is apt for the problem at hand is not this black and white, though. If your data doesn’t follow a normal distribution, nonparametric tests are not necessarily the right pick. The decision is dependent on other factors such as sample size, the type of data you have, what measure of central tendency best represents the data, etc. Certain parametric tests can perform well on non normal data if the sample size is large enough — for example, if your sample size is greater than 20 and your data is not normal, a one-sample t-test will still benefit you. But, if the median better represents your data then you’re better off with a nonparametric test.

In this article, we will be looking at parametric tests — particularly the t-test.

Parametric tests are those that assume that the sample data comes from a population that follows a probability distribution — the normal distribution — with a fixed set of parameters.

Common parametric tests are focused on analyzing and comparing the mean or variance of data.

The mean is the most commonly used measure of central tendency to describe data, however it is also heavily impacted by outliers. Thus it is important to analyze your data and determine whether the mean is the best way to represent it. If yes, then parametric tests are the way to go! If not, and the median better represents your data, then nonparametric tests might be the better option.

As mentioned above, parametric tests have a couple of assumptions that need to be met by the data:

Normality — the sample data come from a population that approximately follows a normal distribution
Homogeneity of variance — the sample data come from a population with the same variance
Independence — the sample data consists of independent observations and are sampled randomly
Outliers — the sample data don’t contain any extreme outliers

Degrees of Freedom.

Before we get into the different statistical tests, there is one important concept that should be discussed — degrees of freedom.

The degrees of freedom are essentially the number of independent values that can vary in a set of data while measuring statistical parameters.

Let’s say you like to go out every Saturday and you’ve just bought four new outfits. You want to wear a new outfit every weekend of the month. On the first Saturday, all four outfits are unworn, so you can pick any. The next Saturday you can pick from three and the third Saturday you can pick from two. On the last Saturday of the month though, you’re left with only one outfit and you have to wear it whether you want to or not, whereas on the other Saturdays you had a choice.

So basically, you had 4–1=3 Saturdays of freedom to choose an outfit — your outfit could vary.

That’s the idea behind degrees of freedom.

With respect to numerical values and the mean, the sum of the numerical values must equal the sample size times the mean, i.e. sum = n * mean, where n is the sample size. So if you have a sample size of 20 and a mean of 40, the sum of of all the observations in the sample must be 800. The first 19 values can be anything, but the 20th value has to ensure that the total of all the values adds up to 800, therefore it has no freedom to vary. Hence the degrees of freedom are 19.

The formula for degrees of freedom is sample size — number of parameters you’re measuring.

Comparing means.

If you want to compare the means of two groups then the right tests to choose between are the z-test and the t-test.

One-sample (one-sample z-test or a one-sample t-test): one group will be a sample and the second group will be the population. So you’re basically comparing a sample with a standard value from the population. We are basically trying to see if the sample comes from the population, i.e. does it behave differently from the population or not.

An example of this is the one we discussed in the previous article — the mean age of patients known to visit a dentist is 18, but we hypothesize it could be greater than this. The sample must be randomly selected from the population and the observations must be independent of one another.

Two-sample (two-sample z-test and a two-sample t-test): both groups will be separate samples. As in the case of one-sample tests, both samples must be randomly selected from the population and the observations must be independent of one another.

Two-sample tests are used when there are two variables involved. For example, comparing the mean money spent on a shopping site between the two sexes. One sample will be female customers and the second sample will be male customers. Since the means are being compared, one of the variables involved in the test has to be numerical (the money spent on a shopping site is the numerical variable).

Important note: don’t confuse one-sample and two-sample with one-tailed and two-tailed! The former is related to the number of samples being compared and the latter with whether your alternate hypothesis is directional. You can have a one-sample two-tailed test.

How do we choose between a z-test and a t-test though? By looking at the sample size and population variance.

If the population variance is known and the sample size is large (greater than or equal to 30) — we choose a z-test
If the population variance is known and the sample size is small (less than 30) — we can perform either a z-test or a t-test
If the population variance is not known and the sample size is small — we choose a t-test
If the population variance is not known and the sample size is large — we choose a t-test

T-test.

As mentioned above, the t-test is very similar to the z-test, barring the fact that it works well with smaller samples and the population variance doesn’t need to be known.

The t-test is based on the t-distribution, which is a bell-shaped curve like the normal distribution, but has heavier tails.

As the sample size increases, the degrees of freedom also increase, and the t-distribution becomes similar to the normal distribution. It becomes less skewed and tighter around the mean (lighter tails). Why? We’ll find out in a bit.

There are three types of t-tests. Introductions for two have already been given above — one-sample and two-sample. Both of these come under the ‘unpaired t-test’ umbrella, and so the third type of t-test is the ‘paired t-test’.

The concept of paired and unpaired is to do with the samples. Is the sample the same or are they two different samples? Are we monitoring a variable in two different groups or the same group? If the sample is the same, then the t-test should be paired, else unpaired.

For example, let’s say you want to test whether a certain medication increases the level of progesterone in women.

If the data you have is the progesterone levels of a group of women before the medication was consumed and the progesterone levels of the same group of women after the medication was consumed, then you would conduct a paired t-test since the sample is the same.

If the data you have is the progesterone level of two groups of women of different age groups after the medication was consumed, then you would conduct a two-sample unpaired t-test since there are two different samples.

Every statistical test has a test statistic which helps us calculate the p-value which then determines whether to reject or not reject the null hypothesis. In the case of the t-test, the test statistic is known as the t-statistic. The formula to calculate the t-statistic differs depending on which t-test you’re performing, so let’s take a closer look at them all.

The code and data used in all the below examples can be found here.

One-sample t-test.

The average height of women in India was recorded to be 158.5cm. Is the average height of women in India today greater than 158.5cm?

To test this hypothesis I asked 25 women their height.

My hypotheses are —

The significance level is 0.05.
The sample mean is 162cm and sample standard deviation is 2.4cm.
Since the sample size is 25, the degrees of freedom will be 24 (25–1).
Since I’m comparing a sample mean with a population mean (standard value), this will be a one-sample test.
Since my hypothesis has a direction — the average sample height is greater than the average population height — this will be a one-tailed test.

The formula to calculate the t-statistic is:

So the t-statistic in our case will be

Next we need to look up the critical value of the t-distribution where alpha is 0.05 and the degrees of freedom are 24 in the table for t-statistic values. The critical value for our scenario is 1.711. Our t-statistic is greater than the critical value, so we can reject the null hypothesis and conclude that the mean height of women in India is greater than 158.5 cm!

While it is better to calculate the p-value in hypothesis testing to reject or not reject the null hypothesis, the formula to calculate the p-value for a t-statistic is a bit tricky. You can either work with the t-distribution table values or simply use the critical value to reject or not reject the null hypothesis when performing hypothesis testing manually. Otherwise using a calculator or python functions will help you get the p-value. Let’s see how!

We’ll start off by reading our csv into a dataframe:

import pandas as pd
data = pd.read_csv("one-sample-t.csv")
data.head()

We have two columns — age and height. For this one-sample t-test, we only need height since we are comparing the mean height of this sample with the population mean — 158.5cm.

Let’s check the mean and standard deviation of the height column:

data.height.mean()
>> 162.053526834564data.height.std()
>> 2.4128517286126048

The assumptions of a t-test state that the sample data must come from a normal distribution. We can check if the height column is normally distributed or not by using a Probability Plot(also known as a QQ plot — Quantile-Quantile plot). In brief, a probability plot is a graphical method to check if a data set follows a particular distribution. It is essentially a plot of two data sets — one is the data whose distribution you want to check, and the other is data from that distribution itself. In our case, one set of data will be the height column and the distribution will be the normal distribution.

import pylab
stats.probplot(data.height, dist="norm", plot=pylab)
pylab.show()

The red line represents the normal distribution and the blue dots represent our data of the height column. The graph above confirms that the height column comes from / follows a normal distribution since the height data points follow the path of the normal distribution line.

Now, we will perform the one-sample t-test using scipy’s stats method. We need to pass it our data and the population mean:

stats.ttest_1samp(data.height,popmean=158.5)
>> Ttest_1sampResult(statistic=7.363748862859639, pvalue=1.32483697812078e-07)

The p-value is ridiculously small! So we can reject the null hypothesis.

Two-sample t-test.

Is there a relationship between age and height of women in India?

To test this hypothesis I asked 50 women their age — 25 women are between 27 and 30 years of age (group A), 25 women are between 37 and 40 years of age (group B).

My hypotheses are —

The significance level is 0.05.
The sample mean and standard deviation for group A are 162cm and 2.4cm respectively.
The sample mean and standard deviation for group B are 158.6cm and 3.4cm respectively.
Since I’m comparing the means of two samples, this will be a two-sample test.
Since my hypothesis is nondirectional, this will be a two-tailed test.

It was mentioned earlier that parametric tests assume homogeneity of variance, i.e. the variance of both the samples should be the same. In the example mentioned here, the variance is definitely not the same — the standard deviation of group A is 2.4cm whereas it’s 3.4cm for group B. Does this mean we can’t perform a two-sample t-test? No, it doesn’t! Thankfully, there’s a variation of the t-test that allows for different variances and it’s called Welch’s t-test.

When the variance of both samples is equal, the denominator used in calculating the t-statistic is known as the pooled variance. If the sample sizes of both groups is different then the formula is:

If the sample sizes of both groups is equal then the formula is simply:

It finds the common variance of the two groups to be used in the t-statistic formula. The formula for the t-statistic is:

However, when the variance of both samples is not equal, the denominator compares both variances and the formula to calculate the t-statistic is:

Furthermore, the calculation of the degrees of freedom also differs between the two tests. If the variance of both groups in the current example were equal, the degrees of freedom would be 48 (25+25–2; we subtract 2 because we are measuring two parameters — the means of each sample).

In the case of Welch’s t-test, the degrees of freedom are fractional, always smaller than the degrees of freedom of Student’s t-test, and frankly a bit complicated to calculate.

Since our variances are not equal, we will be performing Welch’s t-test.

So the t-statistic in our case will be:

Let’s do this in python too.

import pandas as pd
df_a = pd.read_csv("one-sample-t.csv") # group A
df_b = pd.read_csv("two-sample-t.csv") # group B

Group A is the same csv we used for the one-sample t-test, so we already know its mean and standard deviation. Let’s check the same for Group B.

df_b.height.mean()
>> 158.60704061997612df_b.height.std()
>> 3.42443022417948

Now we perform the t-test! To perform Welch’s t-test we simply need to pass the equal_var parameter as False. By default it is true so if we were performing Student’s t-test we needn’t pass it at all.

stats.ttest_ind(df_a.height, df_b.height, equal_var=False)
>> Ttest_indResult(statistic=4.113633648976651, pvalue=0.00017195968508873518)

The p-value is much smaller than 0.05 hence we can reject our null hypothesis.

Paired t-test.

Does nutritional drink xyz increase the height of women?

To test this hypothesis I measured the height of 25 women before they began the course of the nutritional drink and then after they completed the course.

My hypotheses are -

The significance level is 0.05.
The sample mean and standard deviation for the women before the drink are 162cm and 2.4cm respectively.
The sample mean and standard deviation for the women after the drink are 167cm and 3.4cm respectively.
Since the sample size is 25, the degrees of freedom will be 24 (25–1).
Since I’m comparing the means of the same sample but with an intervention in between, this will be a paired t-test.
Since my hypothesis is directional, this will be a one-tailed test.

Fun fact — a paired t-test calculates the differences between the paired observations in the two sets of data (same sample, before and after) and then performs a one-sample t-test with the mean difference and mean standard deviation.

Let’s implement this directly in python:

Read the csv into a dataframe.

import pandas as pd
data = pd.read_csv("paired-t.csv")
data.head()

Use the describe() method to check the mean and standard deviations of both the before and after columns.

data.describe()

Perform the paired t-test using scipy stat’s ttest_rel method! We pass ‘greater’ in the alternative parameter since our alternate hypothesis is that the mean height after the nutritional drink will be greater than the height before the drink.

stats.ttest_rel(data.height_before, data.height_after, alternative='greater')
>> Ttest_relResult(statistic=-1.9094338173992416, pvalue=0.9658844528005113)

The p-value is greater than 0.05, hence we can’t reject our null hypothesis.

Now that we’ve seen all the types of t-tests and their formulae for calculating the t-statistic, we can understand why as the sample size increases the t-distribution becomes similar to the normal distribution. All the different t-tests involve the sample’s standard deviation / variance. This is simply an estimation of the population’s variance since that is unknown. Since the assumption in a t-test is that the sample data is from a population which follows the normal distribution, as the sample size increases and hence the degrees of freedom increase, there is a higher chance of this estimation of variance to actually be correct, i.e. to be the population’s variance. Additionally, the larger the sample size, the closer it is to being the population. Since the population is normally distributed, it makes sense that the t-distribution for this larger sample size with a higher number of degrees of freedom also resembles a normal distribution.