Doing and reporting your first ANOVA and ANCOVA in R

How to test and report the impact of a categorical independent variable on an interval dependent variable.

Matthieu Renard
Towards Data Science

--

Analysis of Variance, or ANOVA, is a frequently-used and fundamental statistical test in many sciences. In its most common form, it analyzes how much of the variance of the dependent variable can be attributed to the independent variable(s) in the model. It is most often used to analyze the impact of a categorical independent variable (e.g., experimental conditions, dog breeds, flower species, etc.) on an interval dependent variable. At its core, an ANOVA will provide much of the same information provided in a simple linear regression (i.e., OLS). However, an ANOVA can be seen as an alternative interface through which this information can be accessed. Different scientific domains might have different preferences, which generally imply which test you should be using.

An Analysis of Covariance, or ANCOVA, denotes an ANOVA with more than one independent variable. Imagine you would like to analyze the impact of the dog’s breed on the dog’s weight, controlling for the dog’s age. Without controlling for the dog’s age, you might never be able to identify the true impact of the dog’s breed on its weight. You, therefore, need to run an ANCOVA to ‘filter out’ the effect of the dog’s age to see if the dog’s breed still influences the weight. Controlling for another covariate might strengthen or weaken the impact of your independent variable of interest.

Photo by Akshay Nanavati on Unsplash

ANOVA

The dataset

For this exercise, I will use the iris dataset, which is available in core R and which we will load into the working environment under the name df using the following command:

df = iris

The iris dataset contains variables describing the shape and size of different species of Iris flowers.

A typical hypothesis that one could test using an ANOVA could be if the species of the Iris (the independent categorical variable) has any impact on other features of the flower. In our case, we are going to test whether the species of the Iris has any impact on the petal length (the dependent interval variable).

Ensuring you don’t violate key assumptions

Before running the ANOVA, you must first confirm that a key assumption of the ANOVA is met in your dataset. Key assumptions are aspects, which are assumed in how your computer calculates your ANOVA results — if they are violated, your analysis might yield spurious results.

For an ANOVA, the assumption is the homogeneity of variance. This sounds complicated, but it basically checks that the variances in different groups created by the categorical independent variable are equal (i.e., the difference between the variances is zero). And we can test for the homogeneity of variance by running Levene’s test. Levene’s test is not available in base R, so we will use the car package for it.

Install the package.

install.packages("car")

Then load the package.

library(car)

Then run Levene’s test.

leveneTest(Petal.Length~Species,df)

This yields the following output:

As you can see, the test returned a significant outcome. Here it is important to know the hypotheses built into the test: Levene’s test’s null hypothesis, which we would accept if the test came back insignificantly, implies that the variance is homogenous, and we can proceed with our ANOVA. However, the test did come back significantly, which means that the variances between Petal.Length of the different species are significantly different.

What now?

Well… talk to your co-authors, colleagues, or supervisors at this point. Technically, you would have to do a robust ANOVA, which provides reliable results even in the face of inhomogeneous variances. However, not all academic disciplines follow this technical guidance… so talk to more senior colleagues in your field.

Anyhow, we will continue this tutorial as if Levene’s test came back insignificant.

Running the actual ANOVA

We do this by specifying this model using the formula notation, the name of the data set, and the Anova command:

fit = aov(Petal.Length ~ Species, df)

In the command above, you can see that we tell R that we want to know if Species impacts Petal.Length in the dataset df using the aov command (which is the ANOVA command in R) and saving the result into the object fit. The two essential things about the above command are the syntax (i.e., the structure, the ~ symbol, the brackets, etc.) and the aov command. Everything else can be modified to fit your data: Petal.Length and Species are names specified by the iris dataset, and df and fit are just names I arbitrarily chose — they could be anything you would want to analyze.

As you might have noticed, R didn’t report any results yet. We need to tell R that we want to access the information saved into the object called fit using the following command:

summary(fit)

This command yields the following output:

This table gives you a lot of information. Still, the key parts we are interested in are the row Species, as this contains the information for the independent variable we specified and the columns F-value and Pr(>F). If our goal is to reject the null hypothesis (in this case, the null hypothesis is that the species of the Iris don’t have any impact on the petal length) and to accept our actual hypothesis (that the species do have an impact on the petal length), we are looking for high F-values and low p-values. In our case, the F-value is 1180 (which is very high), and the p-value is smaller than 0.0000000000000002 (2e-16 written out, and which is — you might have guessed it — very low). This finding supports our hypothesis that the species of the Iris has an impact on the petal length.

Reporting the results of the ANOVA

If we wanted to report this finding, it is good practice to report the means of the individual groups in the data (species in our case). We do this using the describeBy command from the psych package. Use the following command if you haven’t installed the psych package and want to use it for the first time:

install.packages("psych")

Otherwise, or after installing the psych package, run the following commands.

library(psych)
describeBy(df$Petal.Length, df$Species)

For the describeBy function, you communicate the variable you want to see described (Petal.Length) and the grouping variable (Species). We need to specify df in front of the variable names as — differently to the formula notation used by the aov command used above — the describeBy command doesn’t allow us to specify the dataset separately. Running this command yields the following output:

In this output, we can see the three species Setosa, Versicolor, and Virginica, and in the 3rd column, we are presented with the mean of the values for Petal.Length for the three groups.

This finding could be reported in the following way:

We observed differences in petal lengths between the three species of Iris Setosa (M=1.46), Versicolor (M=4.26), and Virginica (M=5.55). An ANOVA showed that these differences between species were significant, i.e. there was a significant effect of the species on the petal length of the flower, F(2,147)=1180, p<.001.

One could also add a graph illustrating the differences using the package ggplot2. Run the below command to install the ggplot2 package if you haven’t already installed it.

install.packages("ggplot2")

Then load the package.

library(ggplot2)

And then run the command for the graph.

ggplot(df,aes(y=Petal.Length, x=Species, fill=Species))+
stat_summary(fun.y="mean", geom="bar",position="dodge")+
stat_summary(fun.data = mean_se, geom = "errorbar", position="dodge",width=.8)

This produces the graph below. The code is rather complex, and explaining the syntax of ggplot2 goes beyond this article's scope, but try to adapt it and use it for your purposes.

ANCOVA

Now imagine you wanted to do the above analysis but while controlling for other features of the flower's size. After all, it might be that the species does not influence petal.length specifically, but more generally, the species influences the overall size of the plant. So the question is: Controlling for other plant size measures, does species still influence the length of the petal? In our analysis, the other metric that represents plant size will be the variable Sepal.Length, which is also available in the iris dataset. So, we specify our extended model just by adding this new covariate. This is the ANCOVA — we are analyzing the impact of a categorical independent variable on an interval dependent variable while controlling for one or more covariates.

fit2=aov(Petal.Length~Species+Sepal.Length,df)

However and unlike before, we cannot simply run the summary command on the fit2 object now. Because by default and very strangely, base R uses type I errors as default. Type I errors are not a problem when performing a simple ANOVA. However, if we are trying to run an ANCOVA, type I errors will lead to wrong results, and we need to use type III errors. If you are interested in what Type I vs. Type III errors are, I can recommend the Jane Superbrain section at the bottom of page 457 in Andy Field’s book “Discovering Statistics Using R.”

Therefore, we need to use another function from a different package to specify the exact type of errors we wish to use. We will be using the car package. Run the below command to install the car package if you haven’t already installed it. It is the same package as for Levene’s test above, so if you’ve been following the tutorial from the beginning, you might not have to install and load the package. If you’re unsure, just run the commands to be safe.

install.packages("car")

Then load the package.

library(car)

Then, run the car Anova command on our fit2 object, specifying that we wish to use Type III errors.

Anova(fit2, type="III")

This produces the following output:

As you can see in our row Species, column Pr(>F), which is the p-value, species still has a significant impact on the length of the petal, even when controlling for the length of the sepal. This could imply that the flowers really have different proportions and aren’t simply bigger or smaller because of the species.

Try running the summary command on the fit2 object to see that the summary command produces incorrect results; however, if you were to look at the fit2 object through the summary.lm command, which produces the output in the style of a linear model (i.e., OLS) and also uses type III errors, you would get the same correct information in the output as via the Anova command from the car package.

We could report this finding as shown below.

The covariate, sepal length, was significantly related to the flowers’ petal length, F(1,146)=194.95, p<.001. There was also a significant effect of the species of the plant on the petal length after controlling for the effect of the sepal length, F(2,146)=624.99, p<.001.

After completing either the ANOVA or ANCOVA, you should normally be running the appropriate post hoc tests to reveal more about the effects. After all, an ANOVA is merely an inferential test, i.e., it tests whether the data is distributed in a way that we would expect if the distribution were random. So far, we only know that there is a relationship between species and sepal length —we know that sepal length is non-randomly distributed when grouped by species. However, how exactly does species influence sepal length? One way of achieving this is by breaking down the variance explained by the independent variable of interest into its components . You can read more about this in my article on planned contrasts.

--

--