An Ode To R-Squared

or, The Statistician’s Crusade: A Duty-Dance with Determination

Callum Ballard
Towards Data Science

--

It is said that being able to explain something to another person is the first step to mastering it. And as an aspiring data scientist, R-Squared feels like the king of thing I ought to master…

This blog therefore covers the following questions:

  • What is R-Squared, and what does it tell us?
  • How can we derive the formula for R-Squared intuitively?
  • Why is R-Squared always between 0–1?
Portrait of Student Trying to Understand R-Squared, With Cat (c.2019)

What is R-Squared, and what does it tell us?

Suppose we have an output metric, Y, with observed data points, yᵢ. Then the most simple way of predicting future observations is to take the mean of our existing observations.

The baseline mean model, ȳ

Note how this baseline model always predicts the same value for the future observations. Note also that this prediction doesn’t depend on the value of x. Needless to say, it’s not a very good model.

To improve on this situation, suppose we instead create a linear regression model, f, to predict values of Y, based on observed data points, yᵢ, and their associated x values.

Our model ‘f

Then we probably want some statistic (let’s call it R-Squared) that will tell us how good our model is. In particular, how much of an improvement is our model f over the ‘baseline mean model’?

R-Squared tells us: ‘How much of the variance from the mean does our model account for?’

If the model accounts for 100% of the variance (i.e. R-Squared = 1), then we can say that it perfectly explains the observed data points.

If the model accounts for 0% of the variance (i.e. R-Squared = 0), then we can say that it explains precisely nothing about the observed data points.

In the real world, R-Squared is good at facilitating comparisons between models. However, identifying a ‘good’ value of R-Squared in and of itself is a bit slippery. Generally, an R-Squared above 0.6 makes a model worth your attention, though there are other things to consider:

  • Any field that attempts to predict human behaviour, such as psychology, typically has R-squared values lower than 0.5. Humans are inherently difficult to predict!
  • A model with a high R-squared value could suffer from other issues, such as over-fitting. R-squared is just one of many ways the data scientist can evaluate the validity of their model.

How can we derive the formula for R-Squared intuitively?

Rather than diving straight into a sea of algebra, I want to think about R-Squared pictorally.

We know already that a regression will create a model that minimises residuals (i.e. the difference between the model’s predicted values of y, and the actual observed values of y). So looking at the residuals of our models seems like a sensible place to start.

Our model, f, (L) and the baseline case ȳ (R)

The residuals for our model f (the blue arrows in the left hand chart) can be thought of as the bits of the observed values that our model can not explain.

There does come a point where, once we start summing these residuals as part of a formula, it is useful to sum their squares. This will ensure that we don’t have to deal with any negative values. It will also amplify the effect of large individual errors, which will make it extra obvious when a model is not making good predictions.

The squared residuals for our model, f, (L) and the baseline case ȳ (R)

Again, we can visualise these squared residuals for both our model, and the baseline case. Here, the blue squares are another way of visualising the part of the error our model can not explain. The total area of the blue squares can be given mathematically as:

The Residual sum of squared errors of our regression model also known as SSE (Sum of Squared Errors)

Further, the sum of the orange squares can be given as:

The Total sum of squared error — the squared difference between y and ȳ

We wanted R-Squared to be comparison of our model with the baseline case. We can make this comparison by taking the area of the blue squares as a proportion of the area of the orange squares:

This expression thus tells us what share of the variance from the mean, ȳ, is not explained by the model. Therefore, the share of the variance that is explained by the model, or the fabled R-Squared, can be given as:

Why is R-Squared always between 0–1?

One of R-Squared’s most useful properties is that is bounded between 0 and 1. This means that we can easily compare between different models, and decide which one better explains variance from the mean.

Of course, we know from the above that R-Squared can be expressed thusly:

So for R-Squared to be bounded between 0 and 1, we require (SSres / SStot) itself to be between 0 and 1. This happens if:

  1. SSres ≤ SStot (for R-Squared to be greater than or equal to 0)
  2. SSres and SStot to be both positive, or both negative (for R-Squared to be less than or equal to 1).

Let’s look at these in turn. Recall that:

The sum of all the squared resisuals from the model.
The sum of all the squared residuals from the mean of observed values.

For (1), we can make an intuitive argument. Remember, SStot represents the gaps between the observed y values, and their mean, ȳ.

Given that y = ȳ represents a straight line (in particular, a horizontal line across the x-y plane), y = ȳ is itself a linear model for our dataset. It isn’t a good model, of course, but it’s a model nonetheless.

Let’s now think about the model for which we’d calculate R-Squared — the one that produces the f terms, and hence produces SSres. This model would have been created by regression, and by definition we know that the regression process produces the model that minimises residuals for the dataset.

We have two possible cases:

  • The regression produces y = ȳ as the model that minimises residuals. Thus, we have SSres = SStot, since fᵢ = ȳ across the dataset.
  • The regression produces a different model. Since the regression is producing a model that minimises residuals, the residuals of this model must be smaller than in the y = ȳ case. Therefore, we must have SSres <SStot.

Thus, we have SSres ≤ SStot.

Hence, R-Squared will always be more than or equal to 0.

For (2), we know that both SSres and SStot are given as the sum of squares (see above). Since square numbers are always positive, we know that both SSres and SStot will always be positive.

Hence, R-Squared will always be less than or equal to 1.

--

--

Analytics Lead at Cleo AI, McKinsey & LSE Alumni, previously published in The Economist. Typically found listening to Japanese Math Rock...