Exploratory Correlational Analysis in R

Painless and tidyverse-friendly correlational analysis using rstatix

Brendan Mapes
Towards Data Science

--

Photo by Armand Khoury on Unsplash

Correlational analysis is one of the most basic and foundational ways to explore the relationship between two or more variables. You may have already performed correlational analysis at some point using R, and it might have looked something like this:

cor_results <- cor.test(my_data$x, my_data$y,
method = "pearson")

cor_results

With an output looking like:

An image shows example output of a correlation test based on test data

This is the base R way to run a simple correlation for two variables that you’ve already picked out in advance.

But what if you didn’t really know what you were looking for? If you’re just at the stage of conducting some exploratory data analysis, you might not know what variables you’re interested in, or where you might want to look for associations. What might be useful in this case is to be able to pick a variable of interest, and then check against a dataset with multiple, maybe even hundreds of variables, in order to figure out a good starting point for further analysis. Thanks to the rstatix package developed by kassambara, there is a quick and relatively painless way to do this.

Getting the Data

For an example, I’ll use data from the World Bank’s World Development Indicators (WDI) dataset — a repository of open access data on global development indicators. We could access the WDI from the website linked above, but there’s also an R package for that

install.packages("WDI")
library(WDI)

Specific data series can be imported from WDI using the WDI() function, but because we’re interested in exploratory analysis covering possible relationship between lots of variables, I’m going to bulk download the whole database…

bulk <- WDIbulk(timeout = 600)

Let’s say we’re interested in trying to figure out what other country characteristics might be associated with countries that trade more, relative to the size of their economy, and we’re also interested in data from the year 2020.

Once we’ve identified the right variable (here I’m going to use Trade as a % of GDP), we need to clean up the data a bit. We’ll create a list of series that are annual that we can filter on, and then apply another filtering step to make sure that we are only using variables that have a good number of observations to use in the analysis (arbitrarily, n>100 in the example below).

# Create a filtered set with only annual variables
filtered <- bulk$Series %>% filter(Periodicity == "Annual")

# Create a list of variables to correlate against trade levels
bulk$Data %>%
filter(Indicator.Code %in% c(filtered$Series.Code)) %>%
filter(year == 2020) %>%
group_by(Indicator.Code) %>%
filter(!is.na(value)) %>%
count() %>%
arrange(n) %>%
filter(n>100) -> vars_list

Running the Analysis

So now we have a list of variables to run through — about 790 of them — to see which might be correlated with our trade level variable. This would take forever to run through manually, or with a loop on cor.test() from base R. This is where the cor_test() function in rstatix shines — it runs pretty quickly, output from the correlational analysis is dumped into a tibble format (making it easy to perform additional manipulation and analysis), and the functions are pipe-friendly, meaning that we can combine filtering, mutation, and execution steps together into a single piped framework, and we can also group variable inputs for grouped output from rstatix (we’ll look at some examples of this a little later).

So, to run the analysis:

# Because WDI contains regional data as well, we'll create a list that only has country codes, and filter our input based on that list
countries <- bulk$Country %>% filter(!Region == "") %>% as_tibble()

bulk$Data %>%
filter(Indicator.Code %in% c(vars_list$Indicator.Code)) %>%
filter(year == 2020) %>%
filter(Country.Code %in% c(countries$Country.Code)) %>%
select(-Indicator.Name) %>%
pivot_wider(names_from = Indicator.Code,
values_from = value) %>%
cor_test(NE.TRD.GNFS.ZS,
method = "pearson",
use = "complete.obs") -> results

results

This populates a tidy tibble with the variable pairings, correlation coefficient (r), t-statistic, confidence level (p), and a low and high confidence estimate. For our example run above, it looks like:

Because the output is a tibble, we can sort and decompose it however we want. Let’s make a key with the variable name and description, join that into our output data, filter for only variable pairings that were significant at p > 0.05 levels, and check out what variables had the highest r value:

indicator_explanation <- bulk$Series %>% select(Series.Code, Indicator.Name, Short.definition) %>% as_tibble()

results %>%
left_join(indicator_explanation, c("var2" = "Series.Code")) %>%
arrange(desc(cor)) %>%
filter(p<0.05) %>%
View()

Some of the variables with the highest correlations wont be surprising — overall trade is positively correlated at a high level across countries with trade in services and merchandise, for example. Others might be more unexpected — like the moderately high positive correlation (r = 0.43) between level of trade and the amount of Official Development Assistance (aid funding) that a country receives as a percent of gross capital formation (often used as an indicator of aid “dependency” — the bottom row in the image above).

Grouped Analysis

So what if we wanted to look into this relationship more? For example — is the relationship still strong if we look at other years besides 2020? This is where the pipe-friendly nature of cor_test() again comes in handy.

Let’s filter our initial data to include just the two indicators we’re interested in, and then group the data by year before piping it into cor_test() this time:

bulk$Data %>% 
filter(Indicator.Code %in% c("NE.TRD.GNFS.ZS", "DT.ODA.ODAT.GI.ZS")) %>%
filter(Country.Code %in% c(countries$Country.Code)) %>%
select(-Indicator.Name) %>%
filter(year<2021) %>%
pivot_wider(names_from = Indicator.Code,
values_from = value) %>%
group_by(year) %>%
cor_test(NE.TRD.GNFS.ZS, DT.ODA.ODAT.GI.ZS,
method = "pearson",
use = "complete.obs") -> results_time

This will give us output on the correlation between the two variables, in each year with observations (I filtered the data to only years prior to 2021, because the ODA data only runs up to 2020). And because the correlation data is stored in a tidy way, we can easily run additional code to visualize our results:

results_time %>% 
mutate(`Significant?` = if_else(p<0.05, "Yes", "No")) %>%
ggplot(aes(x = year, y = cor)) +
geom_hline(yintercept = 0,
linetype = "dashed") +
geom_line() +
ylab("cor (r)") +
geom_point(aes(color = `Significant?`)) +
theme_minimal()

Here we can see that historically there hasn’t been much of a relationship between these two variables at all (apart from a few years here and there with a weak negative relationship), but in the past few years the correlation has trended significant and positive:

So what does this mean? In terms of any potential questions around the relationship between trade and aid — we’d have to do a lot more research. Correlation doesn’t imply causation after all, but this is a good hypothesis generator - are aid-receiving countries becoming increasingly trade-oriented? Or are the patterns of aid delivery shifting to favor countries who trade more? These are new avenues for us to explore. These sorts of quick correlational analyses can be a really useful tool for things like trends analysis or signal spotting — and having a tidyverse-friendly way to do it really avoids a potential headache.

In terms of our ability to conduct some useful exploratory analysis quickly and easily, we can see that rstatix is a helpful companion package. However, there are a few drawbacks to cor_test() in rstatix —

  1. You are limited to Pearson (r), Spearman (ρ), and Kendall (τ) correlation methods, compared to quite a few more in the “correlation” package, for example. These are however, the most common for casual users and should be more than sufficient for basic analysis.
  2. Confidence intervals are only reported in the output for Pearson’s r. Meaning that if confidence intervals are needed for Spearman’s rho or Kendall’s tau, additional code will be required.
  3. Sample size and degrees of freedom are not reported, which could be annoying if the user’s aim is to develop multiple reports on outputs based on different grouped segments, for example.

But these won’t typically apply for the casual user. Additionally, beyond cor_test(), rstatix also has a large number of other functions for various other statistical tests and procedures, which are definitely worth looking into the next time you need to do some exploratory statistical analysis — so shout out to the developers on this one.

Note: For a more in-depth look at the differences between rstatix and other available packages for correlational analysis in R, interested readers can see: https://www.r-bloggers.com/2021/01/correlation-analysis-in-r-part-2-performing-and-reporting-correlation-analysis/

Enjoyed this story? Follow me on Medium, or connect with me on LinkedIn or Twitter.

--

--

Strategic Foresight for Sustainable Development. Writing about the future, food and complex systems. Academic work at https://bit.ly/34cP6gV