I came to write this article through what was a predictable yet still unexpected set of events. I recently finished a course on statistical testing and reporting, and I set out to write a series of articles explaining the details of the most useful statistical tests I learned. I wished to do this both to cement my own knowledge as well as help other data scientists learn a topic I found immensely helpful.
The first of these articles was going to be on the t-test, a common statistical test used to determine if two means (averages) from different sets of data are statistically different. I began to write this article, but I realized I needed to first explain that there are two different kinds of t-tests. Then, I realized that to explain that, I needed to explain a separate but related underlying concept. The cycle continued as I planned out the article.
Furthermore, I realized that I would need to do this with each new article I wrote, as every statistical test required the same underlying knowledge base. Rather than repeat this information in each article, it would be much better to reference one standing source of information.
And thus, this article was born. In the words that follow, I will attempt to give a concise but effective primer on the basic concepts you should be familiar with in order to conduct and report statistical tests. For your convenience, I have broken down the concepts in the order you would encounter them running a study from start to finish. So without further ado, let’s get into it.
Quantitative Study Design
When designing a study, there are several important details one needs to consider. This article is not about study design, and I won’t be going into the details of best practices and the reasoning behind them. That said, the design of a study strongly influence the eventual statistical test needed, and so it is essential to have a basic understanding of the following concepts:
- Factors and measures
- Levels and treatments
- Between vs. Within subjects
Factors and MeasuresWhile you may not have heard the terms "factor" and "measure" before, it is likely you encountered them back in high school science class under different names: "independent variable" and "dependent variable," respectively.
In a scientific experiment, a factor is a variable/condition that you actively manipulate or change in order to observe its effect on a different variable. The variable upon which you are observing an effect is the measure.
This is easier to see with an example. Let’s imagine we are conducting a fun experiment intended to determine if the type of meat a person consumes upon waking up can influence their 100-meter dash time later in the day. We have two groups of participants: Everyone in the first group receives a chicken breast, and everyone in the second group receives a steak. In the afternoon, the members of each group run a 100-meter dash and the respective times are recorded.
In this experiment, the factor is the type of meat, because that is what we are actively changing, and the measure is the 100-meter dash time, because that is the variable upon which we are attempting to observe some effect.
Levels and Treatments These two terms are related to the factor in an experiment. The levels of a factor refer to the number of differing conditions it has within the study. The actual value or manifestation of the factor at each level is a treatment.
For example, in the experiment above, there are two levels, because we are testing out two different types of meat. The two treatments are chicken and beef. Were we to throw duck into the fray, then we would have three levels of the factor, with the third treatment being duck meat.
Between-Subjects and Within-Subjects DesignThese last two are slightly more confusing, but incredibly important – whether a study uses a between-subjects or within-subjects design directly impacts the type of statistical test one can use for analysis.
Fundamentally, this aspect of study design has to do with how participants are split up across the different treatments of the factor(s) in a study.
In a between-subjects design, every participant is exposed to only one treatment, and in a within-subjects design, every participant is exposed to all the treatments. Said another way, a between-subjects design uses different sets of participants for each level of the independent variable, whereas a within-subjects design uses the same set of participants repeatedly.
For instance, consider a study in which we want to see if a new type of contact lens enables better performance on a vision test. We could give one group of participants the initial lens and another group the new lens, and compare their respective performances on the vision test (between-subjects design). Alternatively, we could have the same group of participants try out both lenses and compare the performances on the vision test for the same participants with different lenses (within-subjects design).
Note that a within-subjects design is not always possible. In the meat and running example above, assuming the experiment must be done in a single day (which may well be the case due to resource restrictions), a single person can only have one type of meat for breakfast, not both.
Finally, experiments with multiple factors can incorporate both between-subjects and within-subjects elements. Such an approach is known as a split-plot design. For example, say we want to evaluate performance on a mental health evaluation, and we have two factors: 1) year in college and 2) amount of daily screen time. We decide to conduct this experiment over the course of a year, giving the participants in each year (freshman, sophomore, etc.) no screen time restrictions for the first six months, and a 30-minute daily screen time restriction for the final six months. The mental health evaluation is given at the end of each session.
In this experiment, the screen time is tested in a within-subjects manner (the same participants undergo both treatments), but the year in college is tested in a between-subjects manner (an individual cannot be in two years simultaneously). Note that this experiment is not intended as a model to follow (meticulous readers will notice that many confounding factors are possible), but rather as a simplified example to explain how a split-plot design might look.
With that, let us move forward.
Significance Testing
If you’ve even tangentially dealt with statistical tests before, it’s likely you’ve heard the phrase "statistically significant difference" before. Much of modern statistical testing (within the frequentist paradigm, at least, but we’ll leave that aside for now) lies in trying to determine if there is some meaningful difference among the different treatment groups in an experiment.
The terms in this section are all essential for understanding this idea. We’ll go through these a bit differently than above. First, I will define all the terms. Then, since they are all interrelated within an individual experiment, we’ll go through a single hypothetical experiment, emphasizing the role of each of these terms.
First things first: hypothesis testing. In a traditional statistical experiment, we begin with two hypothesis:
- Null Hypothesis: This hypothesis states that there is no statistically significant difference among the treatment groups.
- Alternative Hypothesis: This hypothesis states that there is a statistically significant difference among the treatment groups. It can be one sided (hypothesizing a difference in a particular direction, i.e. greater or less), or it can be two sided (simply hypothesizing a difference).
In all statistical tests, we start by assuming that the null hypothesis is true. Then, with that assumption, we calculate the likelihood of seeing our actual data. If the likelihood is very low (below a certain threshold – see below), then we determine that the null hypothesis must in fact be false, and we reject it.
Formally, this threshold is known as a p-value. The p-value is the probability that the data we see is due to random chance, assuming the null hypothesis is true. Thus, if the p-value is very low (generally below .05, though this can vary among fields and experiments), we reject the null hypothesis, claiming a statistically significant difference in our results. This makes logical sense, as the low p-value indicates the probability of seeing that data under the null hypothesis is very low.
This is sufficient to get you started – if you’re interesting in learning more details, I recommend this primer on specifically p-values [2].
Two Classes of Tests
Finally, when you are dealing with statistical tests, you need to know whether you should use a parametric or nonparametric tests.
Parametric tests are the more widely known type of statistical test mostly because the more popular tests tend to be parametric. Parametric tests come with a set of requirements on various statistical parameters of the data. For example, all parametric tests require that the data come from a random sample. Other requirements vary from test to test, such as requiring a specific type of distribution.
Unfortunately, these requirements are not always met when dealing with data in the wild. Occasionally, introductory classes teach to just use the test anyway for the class’s sake, and briefly mention the existence of alternative techniques beyond the scope of the class.
However, just using the test anyway is not appropriate in a real-world context where the parameters of the data do not conform to the necessary requirements. Nonparametric tests were designed precisely for this reason. These are statistical tests that do not require anything special of the data, and thus should be used in situations when the data does not behave [3].
For nearly every parametric statistical test, there is a corresponding nonparametric test. Thus, once all the elements of an experiment mentioned above (number of factors, treatments for each factor, etc.) have been taken into account, the final determination of what test one should use concerns whether a parametric or nonparametric test should be used.
At this point, it is natural to wonder why one might use parametric tests at all. While a detailed discussion of this is beyond the scope of this article, the high-level reason is simply that parametric tests provide more statistical power, and so they should be used whenever possible.
Recap and Final Thoughts
Here is a quick review of the foundational concepts you should understand if you’re looking to learn statistical testing:
- Quantitative study design. Understand what makes up an experiment, including factors, measures, treatments, and different participant designs (between subjects and within subjects).
- Significance Testing. Understand how to formulate the null hypothesis and the alternative hypothesis, and how to use the p-value.
- Types of Statistical Tests. Understand when to use a parametric test vs. a nonparametric one.
When you come to the analysis phase of a study, having all the elements above documented clearly with respect to your experiment is extremely helpful. The statistical test you need to use will be directly related to them. That said, it is always good to understand concepts before applying them, and I hope this article has assisted you in that goal.
Happy testing!
Want to excel at Python? Get exclusive, free access to my simple and easy-to-read guides here. Want to read unlimited stories on Medium? Sign up with my referral link below!
References
[1] Lazar, J., Feng, J.H. and Hochheiser, H. (2017). Research Methods in Human-Computer Interaction (2nd ed.). Cambridge, MA. [2] https://towardsdatascience.com/how-to-understand-p-value-in-layman-terms-80a5cc206ec2 [3] Vaughan, L. (2001). Statistical Methods for the Information Professional. Medford, NJ: ASIS&T Press, pp. 139–155.