The world’s leading publication for data science, AI, and ML professionals.

A Closer Look at the Performance of the T-test

A simulation study investigating type I error and power of the t-test under different scenarios. How sample sizes and variances affect the…

Photo by gerald on Pixabay
Photo by gerald on Pixabay

How can we assess how well a statistical test performs? The answer is Monte Carlo simulation. A simulation study is a computer experiment that involves generating data from probability distributions using pseudo-random sampling. The key strength of a simulation is that we can evaluate the behavior of a statistical method since we can control both the test’s assumptions and the "truth". With real-world data, the parameters of the populations are unknown. The simulation is fairly straightforward once we have specified our assumptions and parameters to be used in the data generation. In this article, I will guide you through a small simulation study using R to investigate the performance of two versions of the t-test.

The code used in this article can be found in this Github repo.


Evaluating the performance of the T-test

The two-sample independent t-test is a common statistical test to see if the means of two groups differ. For example, it can be used to test if the effect of a drug differs between the treatment group and the placebo group. The t-test, like any other statistical method, has assumptions. What happens if the assumptions are violated? One of the assumptions of the "regular" t-test, also known as the Student’s t-test, is that the variances of the two groups being compared are equal. There is an adapted t-test called the Welch t-test, which can be more reliable if the variances cannot be assumed to be equal.

Our simulation study will investigate how the performances of the two t-tests are affected by the variances and sample sizes of the two independent samples. Our study targets the null hypothesis and we will look at type 1 error and Power to evaluate the performance. These two concepts are explained in more detail further down in the article.

Questions to be answered by the simulations

We will look at small and large sample properties as well as effects from equal and unequal variances. More precisely, we will look at the following scenarios:

  1. How group sample sizes affect type I error when variances are equal
  2. How group sample sizes affect type I error when variances are not equal
  3. How the differences in means affect power when variances are equal and not equal respectively

First, let us review the assumptions and the formulas for the t-tests.

Assumptions of the t-test

  1. The data is continuous.
  2. The data follows the normal distribution.
  3. The two samples are independent.
  4. The two samples are simple random samples from their respective population.
  5. The variances of the two samples are equal. The Student’s t-test assumes equal variance, the Welch t-test does not assume equal variance.

All assumptions for the Student’s t-test and the Welch test are the same except for the last assumption regarding variances. In our Simulation, we will generate data that satisfies all first four assumptions.

Student’s t-test

The t-statistic is calculated as

where the pooled variance is

with degrees of freedom

Welch t-test

The Welch test does not use the pooled variance, instead, it uses the variances for each sample directly. The t-statistic is calculated as

with degrees of freedom

Type I error and power

The target of our investigation is the null hypothesis, and the performance measurements we will look at to evaluate the tests are type I error and power. Both versions of the t-test evaluate the null hypothesis that the means are the same, against the alternative hypothesis that there is a difference in means.

To assess how the t-tests perform under different scenarios we will look at the rejection rate of the null hypothesis when the null is true and when the null is false. That is, the percentage of p-values smaller than nominal size alpha = 0.05. This will give us the type I error and power respectively. The matrix below gives an overview of the possible outcomes of a statistical test given the truth/falseness of the null hypothesis.

Type I and type II error matrix (Image by Author)
Type I and type II error matrix (Image by Author)

Size or type I error is the probability of rejecting a true null hypothesis for a false alternative (false positive). In other words, it is the probability of erroneously detecting a difference in means when the actual means are not different. We can estimate the test’s type I error by generating data for two independent samples with the same mean. By calculating the rejection rate we estimate the type I error of the test. A test has a significance level of alpha if the size is equal to or less than alpha. Often, the significance level and size are the same, and a common value of alpha is set at 0.05. If they are the same, this means the test has a significance level of 95%. Size level of 0.05 is commonly used and this means that we expect that, on average, about 5% of the times the test will get a false positive due to randomness.

Power is the probability of rejecting a false null hypothesis (true positive). It gives the probability of detecting a true difference in means. Like for type I error, it can be estimated by generating two independent samples. In this case, we will generate samples with different means, run the t-test and calculate the rejection rates. Ideally, we want the power of a test to be as high as possible, but at least higher than 80%. If the power is 80% that means that the probability of detecting a true difference in means is 80%.

Code to run the simulations

We want to create a function that runs a simulation and that allows us to specify the scenario we want to investigate by specifying the parameters in its arguments.

The function below takes formal arguments seed (for reproducibility), S for the number of iterations, and for both samples, we can specify sample size, mean and standard deviation. The function generates data from the normal distribution with the given parameter values and performs the Student’s t-test and the Welch’s test. The p-values for each dataset and each test are stored and the proportions of rejections of the null hypothesis are calculated and returned. The proportion is the size or power (depending on whether the means are passed as equal or different).


Scenario 1 (type I error): Equal variances, small and large samples

Once we have a function to simulate different scenarios we can run our first simulation where we evaluate type I error (false positives), by specifying the means for both groups to be equal. This simulation does not violate any assumption of either version of the t-test. The values in the simulation are specified as below.

The code in R to run the simulation:

This table shows the results and we can see that, when variances are equal, the Student’s t-test performs well with alpha values at the nominal level for both small and large samples. The Welch test performs well on large samples but poorly on small samples.

Image by Author
Image by Author

Scenario 2 (type I error): Unequal variances, small and large samples

In the second simulation, we look at type I error (false positives) when variances are unequal. This violates the assumption of the Student’s t-test, so we expect to see some alpha values that differ from the target level of 0.05. The Welch test does not assume equal variances so let us see if this test performs better than the Student’s t-test. We set the values in the simulation as follows.

The code in R to run the simulation:

As we expected, when variances are unequal, the alpha values for the Student’s t-test begin to differ from the nominal level. It only has an alpha value on target for the largest sample size of 200. It differs from the target for the other sample sizes, with increasingly larger differences for smaller samples. The Welch test performs well with alpha values at the target of 5% for all sample sizes except the smallest sample size 3.

Image by Author
Image by Author

Sidenote on scenarios when the sample sizes of the two groups are not the same. I also ran simulations as scenarios 1 and 2 but changed the sample sizes to be unequal. When variances are equal, the Student’s t-test has the same result when sample sizes are equal as when sample sizes are unequal. We saw that the Student’s t-test is sensitive to unequal variance, and adding unequal sample sizes makes the test perform even worse. It even performed poorly for large sample sizes > 200. The Welch test is not sensitive for unequal variances and is still robust when adding unequal sample sizes. The Welsh test does not perform well, however, on small sample sizes regardless of equal/unequal variances/sample sizes.

Scenario 3 and 4 (power): Equal and unequal variances

In the third and fourth scenarios, we compare the power (true positives) of the two t-tests for different values of the mean in the second sample group, holding all other parameters equal. In these simulations, we will see what the probability is of detecting the actual difference in means. In the first run, we specify the variances of the two groups to be equal and in the second simulation, we specify the variances to be non-equal. The values for the simulations are as follows.

Code in R for the simulations evaluating power:

The plot below shows the power of the two t-tests, bright red and blue shows power for the t-tests when variances are equal, dark red and blue shows power when variances are non-equal. We can see that the Student’s t-test and the Welch test have almost identical power, both then variances are equal, and when variances are unequal. Although, the power is lower, for both tests, when variances are unequal.

Image by Author
Image by Author

This means that if we have two samples with unequal variances, choosing the regular t-test or the Welch test does not affect the test’s power. However, since we have seen that the Type 1 Error is higher for the regular t-test than the Welch test if variances are unequal, the Welch test is to be preferred. One should keep in mind though, that unequal variance may lead to lower power. (Power also depends on other variables such as alpha level, sample size and effect size.)


Conclusions

A simulation study is very effective for evaluating statistical methods. In this article, we did a simulation to evaluate type I error and power for the two-saple t-tests. The simulations showed that the Student’s t-test, which assumes equal variances of both groups, is very robust for both small and large sample sizes when variances are equal. The t-test does not perform well when the assumption of equal variances is not met.

The Welch test does not assume equal variances and we can see from the simulations that it is more robust than the t-test when variances and sample sizes are non-equal. It does not perform well for small samples sizes when variances and sample sizes are equal. It performed well for large sample sizes.

Both versions of the test give almost identical power. An interesting observation from the simulation is how low the power is for the smaller differences in means, especially when variances are unequal when the sample size is only 20.


If you enjoy reading stories like these and want to support me as a writer, consider signing up to become a Medium member. It’s $5 a month, giving you unlimited access to stories on Medium. If you sign up using my link, I’ll earn a small commission.

Join Medium with my referral link – Andrea Gustafsen


Related Articles