The world’s leading publication for data science, AI, and ML professionals.

Foundations of correlational analyses

How to appropriately measure bivariate relationships in your data

Photo by Isaac Smith on Unsplash
Photo by Isaac Smith on Unsplash

Valentine’s Day may be a distant memory, yet it still feels like the perfect opportunity to discuss relationships with you – That is relationships within your data.

By the end of this post, you will have a better understanding of the most common Correlation coefficients and when to use them. No longer relying on default settings, this freedom will give you more confidence when interpreting and presenting the results from your correlational studies.

We will look at:

  • Pearson’s r
  • Spearman’s rho
  • Kendall’s tau
  • Point biserial correlation

If you wish to follow along, you can find the dataset I used here


How to measure relationships

We can determine whether relationships exist between variables by measuring how they covary. Before we get into covariance, remember that variance is the average amount that our data will deviate from the mean (eq 1).

Eq 1 - Equation for variance (s²). Where it is equal to the sum (∑) of the squared differences between the ith score (xi) and the mean (X) and divided by the number of observations (N) minus 1.
Eq 1 – Equation for variance (s²). Where it is equal to the sum (∑) of the squared differences between the ith score (xi) and the mean (X) and divided by the number of observations (N) minus 1.

We square the deviations in the above equation to prevent positive and negative deviations from cancelling each other out. Just as a quick example, suppose we have the following set of values (2, 3, 5, 7, 8). The sum of these values is 25, which results in an average of 5. Using equation 1, we get:

[(2–5)² + (3–5)² + (5–5)² + (7–5)² + (8–5)²] / 5–1 = 6.5

If we didn’t square the deviates, the resulting variance would be zero, which clearly isn’t the case.

Covariance, then, is the measure of how two variables vary with respect to one another. To calculate this, we simply find the average of the cross product in deviations (eq 2). If a relationship exists, as one variable deviates so too would the second variable. If both variables deviate in the same direction, the resulting value for covariance will be positive (the cross product of two positive values). Conversely, if both variables deviate in opposite directions, the resulting value for covariance will be negative (the cross product between a positive and negative value).

Eq 2 - Equation for covariance. Where it is equal to the cross product of the variance in each variable x and y.
Eq 2 – Equation for covariance. Where it is equal to the cross product of the variance in each variable x and y.

The problem with covariance is that it has limited utility because the value is not standardized. For example, in the dataset we will be using, there is a variable for BMI (body mass index). This value can be measured in kg/m² or lb/in², where there is a conversion factor of 703. Depending on which units are used, the covariance value will be very different. How can we compare such different values despite them representing the same thing (BMI)?

By calculating the square root of variance, we get a standardized measure of variance (aka standard deviation). Doing so allows us to compare deviations from the mean in a useful way, independent of the units the variables were measured in. When we extend this idea to covariance, the result is a standardized covariance, known as the correlation coefficient (eq 3).

Eq 3 - Equation for standardized covariance (r). Where the covariance is divided by the cross product of the standard deviations of x and y. Also known as Pearson's correlation coefficient.
Eq 3 – Equation for standardized covariance (r). Where the covariance is divided by the cross product of the standard deviations of x and y. Also known as Pearson’s correlation coefficient.

This value is very useful because the result is bounded between -1 and +1. Here, an r value of -1 indicates a perfect negative relationship, an r value of +1 indicates a perfect positive relationship, and an r value of 0 indicates no linear relationship.

The r value is also used as a measure of effect size, where a value of +/- 0.1 suggests a small effect, +/- 0.3 a moderate effect, and +/- 0.5 a large effect.

In the next section, we will explore different types of correlations and how to choose the appropriate test for your research questions.

Bivariate correlations

The most common bivariate correlations include Pearson’s correlation, as described above, Spearman’s rho, Kendall’s tau, and the point biserial correlation.

Pearson’s R

Let us start by exploring our dataset using Pearson’s correlation. Pearson’s r is the default correlation in python. If we plot a heatmap (fig. 1) showing the relationships between our variables we get the following:

Figure 1 - Heatmap annotated with the r values from bivariate correlations using Pearson's. Image by author
Figure 1 – Heatmap annotated with the r values from bivariate correlations using Pearson’s. Image by author

Suppose we want to determine if there is a relationship between age and the amount one would pay for health insurance (charges). Directing our attention to the bottom left corner of the above figure, we see that there exists a positive correlation (and a medium effect size) between our variables, r = 0.30. As we will soon see, we must be cautious when using default settings.

If we were only interested in determining whether a linear relationship exists, we would simply need to ensure that our variables were measured at least at an interval level (meaning they are ordered, and there is an equal distance between levels). Here, age and charges both satisfy that assumption. If we wish to determine if that relationship is statistically significant, however, further assumptions would have to be considered. Specifically, for the test to be valid, our variables should be normally distributed.

First, let’s plot the distribution of each variable to inspect for normality (fig. 2). It is clear, even without additional tests, that both variables are not normally distributed. For age, this doesn’t seem to be caused by extreme scores (outliers). Other than a peak around 20, the remaining values for age are similar. For charges, on the other hand, the distribution is skewed heavily toward the left. This is good news for those paying, since it means that the majority are paying less for their insurance policies!

Figure 2 - Distribution plots for the variables charges (left) and age (right). Image by author
Figure 2 – Distribution plots for the variables charges (left) and age (right). Image by author

A good tool to investigate outliers is to use boxplots (fig 3). These plots let us visualize data using quartiles. The line inside the box represents the median (Q2-50th percentile), and the box is bounded by the 1st (Q1-25th percentile) and 3rd (Q3-75th percentile) quartiles. By subtracting the value of Q1 from Q3 we get the interquartile range (IQR). Adding 1.5x IQR to Q3 and subtracting that value from Q1 provides the minimum and maximum values (excluding outliers) which are demarcated with the lines extending outward from the box (called whiskers). Any values beyond the whiskers are considered outliers. As evident in the boxplot for the variable charges, many outliers are present.

Figure 3 - Boxplots for the variables age (left) and charges (right). Image by author
Figure 3 – Boxplots for the variables age (left) and charges (right). Image by author

This is a good example where Pearson’s correlation would not be an appropriate test to measure the relationship between two variables. Instead we would have to consult one of the other bivariate options.

Spearman’s rho vs Kendall’s tau

Spearman and Kendall correlation coefficients are calculated from ranked data. They can be used when data is ordinal (meaning they are ordered, without an equal distance between levels) or if the data failed to meet the assumptions for Pearson’s correlation (as in our example). Below you can find how the correlation differs based on the correlation coefficient used.

Notably, we find that the relationship between age and charges is more important when using the rank order correlations. Conversely, the relationship between variables BMI and smoker with charges are more weakly correlated compared to Pearson’s r.

Point biserial correlation

In our dataset, independent of the correlation test used, age and smoker have the strongest relationship with charges. I want to bring your attention to the variable smoker because it represents a special case. Smoker is a binary variable (not a smoker vs smoker) and a relationship between a binary variable and a continuous variable should be investigated using the point biserial correlation. Using the code below we arrive at a correlation of 0.79, which is the same as the result using Pearson’s r. In fact, point biserial correlation is mathematically equivalent to r when a binary variable is included.

Summary

In this article, we learned how to measure the relationship between two variables. We saw that using default settings to perform correlational analysis can be problematic and lead to inappropriate conclusions. We discussed the most common correlation coefficients and when to use them. If you want to check out my notebook, please visit my Github page.

Helpful resources

Correlation .&text=Linearity%20assumes%20a%20straight%20line,distributed%20about%20the%20regression%20line.)— Statistics solutions

Useful guide to correlation coefficients


Related Articles