Ways to Evaluate Regression Models

The very naive way of evaluating a model is by considering the R-Squared value. Suppose if I get an R-Squared of 95%, is that good enough? Through this blog, Let us try and understand the ways to evaluate your regression model.

Shravankumar Hiregoudar

Published in

Towards Data Science

11 min readAug 4, 2020

Evaluation metrics;

Mean/Median of prediction
Standard Deviation of prediction
Range of prediction
Coefficient of Determination (R2)
Relative Standard Deviation/Coefficient of Variation (RSD)
Relative Squared Error (RSE)
Mean Absolute Error (MAE)
Relative Absolute Error (RAE)
Mean Squared Error (MSE)
Root Mean Squared Error on Prediction (RMSE/RMSEP)
Normalized Root Mean Squared Error (Norm RMSEP)
Relative Root Mean Squared Error (RRMSEP)

Let us consider an example of predicting Active Pharmaceutical Ingredients (API) concentration in a tablet. Using absorbance units from NIR spectroscopy we predict the API level in the tablet. The API concentration in a tablet can be 0.0, 0.1, 0.3, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0. We apply PLS (Partial Least Square) and SVR (Support Vector Regressor) for the prediction of API level.
NOTE: The metrics can be used to compare multiple models or one model with different models

Mean/Median of prediction

We can understand the bias in prediction between two models using the arithmetic mean of the predicted values.

For example, The mean of predicted values of 0.5 API is calculated by taking the sum of the predicted values for 0.5 API divided by the total number of samples having 0.5 API.

np.mean(predictedArray)

In Fig.1, We can understand how PLS and SVR have performed wrt mean. SVR predicted 0.0 API much better than PLS, whereas, PLS predicted 3.0 API better than SVR. We can choose the models based on the interest of the API level.

Disadvantage: Mean is affected by outliers. Use Median when you have outliers in your predicted values

Fig.1. Comparing the mean of predicted values between the two models

Standard Deviation of prediction

The standard deviation (SD) is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set,. In contrast, a high standard deviation indicates that the values are spread out over a broader range. The SD of predicted values helps in understanding the dispersion of values in different models.

np.std(predictedArray)

In Fig.2, The dispersion of predicted values is less in SVR compared to PLS. So, SVR performs better when we consider the SD metrics.

Fig.1. Comparing the standard deviation of predicted values between the two models

Range of prediction

The range of the prediction is the maximum and minimum value in the predicted values. Even range helps us to understand the dispersion between models.

Coefficient of Determination (R2)

R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model. Whereas correlation explains the strength of the relationship between an independent and dependent variable, R-squared explains to what extent the variance of one variable explains the variance of the second variable. So, if the R2 of a model is 0.50, then approximately half of the observed variation can be explained by the model’s inputs.

R (Correlation) (source: http://www.mathsisfun.com/data/correlation.html)

from sklearn.metrics import r2_score
r2_score(Actual, Predicted)

Disadvantage: R2 doesn’t consider overfitting. For more details.

Relative Standard Deviation (RSD) / Coefficient of Variation (CV)

There is a saying that apples shouldn’t be compared with oranges or in other words, don’t compare two items or group of items that are practically incomparable. But the lack of comparability can be overcome if the two items or groups are somehow standardized or brought on the same scale. For instance, when comparing the variances of two groups that are overall very different, such as the variance in the size of bluefin tuna and blue whales, the coefficient of variation (CV) is the method of choice: the CV simply represents the variance of each group standardized by its group mean

The coefficient of variation (CV), also known as relative standard deviation (RSD), is a standardized measure of the dispersion of a probability distribution or frequency distribution. It helps us in understanding how the spread is the data in two different tests

Standard deviation is the most common measure of variability for a single data set. But why do we need yet another measure, such as the coefficient of variation? Well, comparing the standard deviations of two different data sets is meaningless, but comparing coefficients of variation is not.

from scipy.stats import variation
variation(data)

For example, If we consider two different data;

Data 1: Mean1 = 120000 : SD1 = 2000

Data 2: Mean2 = 900000 : SD2 = 10000

Let us calculate CV for both datasets

CV1 = SD1/Mean1 = 1.6%

CV2 = SD2/Mean2 = 1.1%

We can conclude Data 1 is more spread out than Data 2

Relative Squared Error (RSE)

The relative squared error (RSE) is relative to what it would have been if a simple predictor had been used. More specifically, this simple predictor is just the average of the actual values. Thus, the relative squared error takes the total squared error and normalizes it by dividing by the total squared error of the simple predictor. It can be compared between models whose errors are measured in the different units.

Mathematically, the relative squared error, Ei of an individual model i is evaluated by the equation:

where P(ij) is the value predicted by the individual model i for record j (out of n records); Tj is the target value for record j, and Tbar is given by the formula:

For a perfect fit, the numerator is equal to 0 and Ei = 0. So, the Ei index ranges from 0 to infinity, with 0 correspondings to the ideal.

Mean Absolute Error (MAE)

In statistics, mean absolute error (MAE) is a measure of errors between paired observations expressing the same phenomenon. Examples of Y versus X include comparisons of predicted versus observed, subsequent time versus initial time, and one technique of measurement versus an alternative technique of measurement. It has the same unit as the original data, and it can only be compared between models whose errors are measured in the same units. It is usually similar in magnitude to RMSE, but slightly smaller. MAE is calculated as:

from sklearn.metrics import mean_absolute_error
mean_absolute_error(actual, predicted)

It is thus an arithmetic average of the absolute errors, where yi is the prediction and xi the actual value. Note that alternative formulations may include relative frequencies as weight factors. The mean absolute error uses the same scale as the data being measured. This is known as a scale-dependent accuracy measure and, therefore cannot be used to make comparisons between series using different scales.

Note: As you see, all the statistics compare true values to their estimates, but do it in a slightly different way. They all tell you “how far away” are your estimated values from the true value. Sometimes square roots are used and occasionally absolute values — this is because when using square roots, the extreme values have more influence on the result (see Why to square the difference instead of taking the absolute value in standard deviation? or on Mathoverflow).

In MAE and RMSE, you simply look at the “average difference” between those two values. So you interpret them comparing to the scale of your variable (i.e., MSE of 1 point is a difference of 1 point of actual between predicted and actual).

In RAE and Relative RSE, you divide those differences by the variation of actual, so they have a scale from 0 to 1, and if you multiply this value by 100, you get similarity in 0–100 scale (i.e. percentage).

The values of ∑(MeanofActual — actual)² or ∑|MeanofActual — actual| tell you how much actual differs from its mean value — so you could tell what it is about how much actual differs from itself (compare to variance). Because of that, the measures are named “relative” — they give you results related to the scale of actual.

Relative Absolute Error (RAE)

Relative Absolute Error (RAE) is a way to measure the performance of a predictive model. RAE is not to be confused with relative error, which is a general measure of precision or accuracy for instruments like clocks, rulers, or scales. It is expressed as a ratio, comparing a mean error (residual) to errors produced by a trivial or naive model. A good forecasting model will produce a ratio close to zero; A poor model (one that’s worse than the naive model) will produce a ratio greater than one.

It is very similar to the relative squared error in the sense that it is also relative to a simple predictor, which is just the average of the actual values. In this case, though, the error is just the total absolute error instead of the total squared error. Thus, the relative absolute error takes the total absolute error and normalizes it by dividing by the total absolute error of the simple predictor.

Mathematically, the relative absolute error, Ei of an individual model i is evaluated by the equation:

where P(ij) is the value predicted by the individual model i for record j (out of n records); Tj is the target value for record j, and Tbar is given by the formula:

For a perfect fit, the numerator is equal to 0 and Ei = 0. So, the Ei index ranges from 0 to infinity, with 0 correspondings to the ideal.

Mean Squared Error (MSE)

Mean Squared Error (MSE) or Mean Squared Deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors — that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive (and not zero) is because of randomness or because the estimator does not account for information that could produce a more accurate estimate.

The MSE assesses the quality of a predictor (i.e., a function mapping arbitrary inputs to a sample of values of some random variable), or an estimator (i.e., a mathematical function mapping a sample of data to an estimate of a parameter of the population from which the data is sampled). The definition of an MSE differs according to whether one is describing a predictor or an estimator.

The MSE is a measure of the quality of an estimator — it is always non-negative, and values closer to zero are better.

from sklearn.metrics import mean_squared_error
mean_squared_error(actual, predicted)

Let’s analyze what this equation actually means.

In mathematics, the character that looks like weird E is called summation (Greek sigma). It is the sum of a sequence of numbers, from i=1 to n. Let’s imagine this like an array of points, where we go through all the points, from the first (i=1) to the last (i=n).
For each point, we take the y-coordinate of the point, and the y’-coordinate. We subtract the y-coordinate value from the y’-coordinate value and calculate the square of the result.
The third part is to take the sum of all the (y-y’)² values and divide it by n, which will give the mean.

Our goal is to minimize this mean, which will provide us with the best line that goes through all the points. For more information.

Root Mean Squared Error on Prediction (RMSE / RMSEP)

In statistical modeling and particularly regression analyses, a common way of measuring the quality of the fit of the model is the RMSE (also called Root Mean Square Deviation), given by

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(actual, predicted)
rmse = sqrt(mse)

where yi is the ith observation of y and ŷ the predicted y value given the model. If the predicted responses are very close to the true responses the RMSE will be small. If the predicted and true responses differ substantially — at least for some observations — the RMSE will be large. A value of zero would indicate a perfect fit to the data. Since the RMSE is measured on the same scale, with the same units as y, one can expect 68% of the y values to be within 1 RMSE — given the data is normally distributed.

NOTE: RMSE is concerned with deviations from the true value whereas S is concerned with deviations from the mean.

So calculating the MSE helps to compare different models that are based on the same y observations. But what if

one wants to compare model fits of different response variables?
the response variable y is modified in some models, e.g. standardized or sqrt- or log-transformed?
And does the splitting of data into a training and test dataset (after the modification) and the RMSE calculation based on the test data an effect on point 1. and 2.?

The first two points are typical issues when comparing ecological indicator performances and the latter, so-called validation set approach, is pretty common in statistical and machine learning. One solution to overcome these barriers is to calculate the Normalized RMSE.

Normalized Root Mean Squared Error (Norm RMSEP)

Normalizing the RMSE facilitates the comparison between datasets or models with different scales. You will find, however, various different methods of RMSE normalizations in the literature:

You can normalize by

If the response variables have few extreme values, choosing the interquartile range is a good option as it is less sensitive to outliers.

RMSEP/standard deviation is called Relative Root Mean Squared Error (RRMSEP)
1/RRMSEP is also a metric. A value greater than 2 is considered to be a good.

There are also terms like, Standard Error of Prediction(SEP) and Ratio of the Standard Error of Prediction to Standard Deviation (RPD) which are mainly used in chemometrics.

I hope this blog helped you to understand different metrics to evaluate your regression model. I have used multiple sources to understand and write this article. Thank you for your time.