When I first started out doing machine learning, I learnt that:
- R² is the coefficient of determination, a measure of how well is the data explained by the fitted model,
- R² is the square of the coefficient of correlation, R,
- R is a quantity that ranges from 0 to 1
Therefore, R² should also range from 0 to 1.
Colour me surprised when the r2_score
implementation in sklearn returned negative scores. What gives?

This article is adapted from my original blogpost here
The answer lies in the definition
R² is defined upon the basis that the total sum of squares of a fitted model is equal to the explained sum of squares plus the residual sum of squares, or:

where:
- Total sum of squares (SS_tot) represent the total variation in data, measured by the sum of squares of the difference between expected and actual values,
- Explained sum of squares (SS_exp) represent the variation in data explained by the fitted model, and
- Residual sum of squares (SS_res) represent variation in data that is not explained by the fitted model.
R² itself is defined as follows:

Given these definitions, note that negative R² is only possible when the residual sum of squares (SS_res) exceeds the total sum of squares (SS_tot). As this is not mathematically possible, it can only mean that the explained sum of squares and residual sum of squares no longer add up to equal the total sum of squares. In other words, the equality in Equation 1 does not appear [1] to be true.
How can this be?
Because we evaluate models separately on train and test data
Following the above definitions, SS_tot can be calculated using just the data itself, while SS_res depends both on model predictions and the data. While we can use any arbitrary model to generate the predictions for scoring, we need to realize that the aforementioned equality is defined for models trained on the same data. Therefore, it doesn’t necessarily hold true when we use test data to evaluate models built on train data! There is no guarantee that the differences between a foreign model’s predictions and the data is smaller than the variation within the data itself.
We can demonstrate this empirically. The code below fits a couple of linear regression models on randomly generated data:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import numpy as np
for _ in range(20):
data = np.random.normal(size=(200, 10))
X_train = data[:160, :-1]
X_test = data[160:, :-1]
y_train = data[:160, -1]
y_test = data[160:, -1]
lr = LinearRegression()
lr.fit(X_train, y_train)
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)
train_score = r2_score(y_train, y_train_pred)
test_score = r2_score(y_test, y_test_pred)
print(f"Train R2: {train_score:.3f}, Test R2: {test_score:.3f}")
Try as we might, the R² never drops below zero when the models are evaluated on train data. Here’s what I got in STDOUT:
Train R2: 0.079, Test R2: -0.059
Train R2: 0.019, Test R2: -0.046
Train R2: 0.084, Test R2: -0.060
Train R2: 0.020, Test R2: -0.083
Train R2: 0.065, Test R2: -0.145
Train R2: 0.022, Test R2: 0.032
Train R2: 0.048, Test R2: 0.107
Train R2: 0.076, Test R2: -0.031
Train R2: 0.029, Test R2: 0.006
Train R2: 0.069, Test R2: -0.150
Train R2: 0.064, Test R2: -0.150
Train R2: 0.053, Test R2: 0.096
Train R2: 0.062, Test R2: 0.022
Train R2: 0.063, Test R2: 0.008
Train R2: 0.059, Test R2: -0.061
Train R2: 0.076, Test R2: -0.191
Train R2: 0.049, Test R2: 0.099
Train R2: 0.040, Test R2: -0.012
Train R2: 0.096, Test R2: -0.373
Train R2: 0.073, Test R2: 0.088
So … what about R² being the square of correlation?
It appears that R² = R * R only under limited circumstances. Quoting the paragraph below from the relevant Wikipedia page:
There are several definitions of R² that are only sometimes equivalent. One class of such cases includes that of simple linear regression where r² is used instead of R². When only an intercept is included, then r² is simply the square of the sample correlation coefficient (i.e., r) between the observed outcomes and the observed predictor values. If additional regressors are included, R² is the square of the coefficient of multiple correlation. In both such cases, the coefficient of determination normally ranges from 0 to 1.
In short, R² is only the square of correlation if we happen to be (1) using linear regression models, and (2) are evaluating them on the same data they are fitted (as established previously).
On the liberal use of R² outside the context of linear regression
The quoted Wikipedia paragraph lines up with my observation flipping through statistical texts: R² is almost always introduced within the context of linear regression. That being said, the formulation of R² makes it universally defined for any arbitrary predictive model, regardless of statistical basis. It is used liberally by data scientists in regression tasks, and is even the default metric for regression models in sklearn. Is it right for us to use R² so freely outside its original context?
Honestly, I don’t know. On one hand, it clearly has a lot of utility as a metric, which led to its widespread adoption by data scientists in the first place. On the other hand, you can find discussions like these online that caution against using R² for non-linear regression. It does seem to me that from a statistics perspective, it is important for R² to be calculated under the right conditions such that its properties can be utilized for further analysis. I take my observed relative lack of discourse about R within Data Science circles to mean that from a data science perspective, R² doesn’t mean more than being a performance metric like MSE or MAE.
Personally, I think we are good with using R², as long as we understand it enough to know what not to do with it.
Wrapping up
To summarize, we should expect R² to be bounded between zero and one only if a linear Regression model is fit, and it is evaluated on the same data it is fitted on. Else, the definition of R² can lead to negative values.
[1] Being specific with my choice of words here 🙂