RMSE: Distorting the Evaluation of Results

RMSE evaluation is weak and should not be used for multi-classification as well as regression with scattered results.

Published in

Towards Data Science

7 min readNov 28, 2021

To teach the basic concepts of classification and regression, “RMSE Evaluation” is usually used as a common evaluation method. From the beginning, people have a positive view of this method of evaluation. But if this evaluation method is used for a comprehensive multi-class classification challenge or a regression challenge, the weaknesses of the final results can be easily identified.

In these challenges, everyone strives for a better score, and it is natural for participants to try different methods for classification and regression. But if these challenges are evaluated on an RMSE basis, the results of all the classification and regression methods will be captured by the “RMSE Cage”. In other words, the results will have little scatter and all the results will be obtained by mistake, only around the average of the correct values.

Please note that if, for example, the evaluation of a multi-class classification challenge is based solely on RMSE; Before the final score of the participants is calculated based on RMSE, first the participants themselves set all their methods and algorithms (XGBoost, CATBoost, …) based on RMSE so that the best score during the evaluation To earn. This is exactly what makes all methods try to minimize the RMSE equation. But minimizing this equation can not always bring us closer to the correct answer. That is, the results will have little scatter and will only be around a number (mean(y_train)). In other words, some classes will usually have no representation in our results, simply because they are not close to that number.

The mathematical proof of this is not the subject of our article. In this article; We review the best results of a Kaggle challenge (which even includes some of the winners of this challenge) and show that due to the use of RMSE evaluation, the scatter of all results submitted to this challenge is very low. For this reason, we can not see some of the correct values in any of the predictions. We also suggest a few methods so that you can improve your score in this type of challenge by increasing the scatter of your results. This article was written in December 2021 by Somayeh Gholami and Mehran Kazeminia.

Tabular Playground Series — Aug 2021

In 2021, the “Tabular Playground Series” challenges will be held in Kaggle every month. These challenges are great for practicing, and everyone can learn from other people’s notebooks and live comments. In some of these challenges (eg August) RMSE has been used for evaluation. In the following, we will review the August challenge, and the results submitted this month.

In this challenge, 100 features are provided for each sample. The “train” set contains 250,000 samples and the “test‌” set contains 150,000 samples. The answers (loss column) in the “train” set are numbers 0 to 42. This means that each of the 150,000 samples in the “test‌” set must also be assigned a number between 0 to 42. Ultimately, the winner of this challenge will be someone whose results can minimize the RMSE equation.

In these challenges, the “train” set and the “test” set are always randomized (so-called shuffle), except in the case of a time series challenge or similar challenges. This means that the numbers in the “loss” column in the “test” set must be between 0 and 42, and also the histogram of these numbers must be similar to the histogram in the “loss” column in the “train” set. For example, if in the “train” set approximately 25% of the “loss” column is zero, then about 25% of the “loss” column in the “test” set should also be zero.

It is not possible to run code in the “Medium Platform”, but we have already published the following four notebooks for this challenge in Kaggle. You can find the code for all the results presented in the rest of this article in these notebooks and run it yourself:

[1] TPS Aug 21 — XGBoost & CatBoost
[2] TPS8 — Smart Ensembling
[3] TPS Aug 21 — Results & RMSE Evaluation
[4] TPS Aug 21 — Guide Point [Snap to Curve]

By: Somayyeh Gholami & Mehran Kazeminia

Profile of challenge data

The values of the “loss” column of the “train” set include 250,000 numbers between 0 and 42. Also, according to the chart below, 0.240576 of them are number 0, 0.088276 of them are number 1, 0.088900 of them are number 2, …. and 0.000012 of them are number 42. As mentioned earlier; In this challenge, 150,000 numbers must be predicted for the “loss” column of the “test” set. The predicted numbers should be about 0.24 of them zero, 0.08 of them one, and so on.

Surprisingly, none of the predictions are even close to the correct values. You will see that the numbers predicted by all methods are between 3 and 12. For example, none of the predictions are close to zero. But we know that about 24% of the total predicted numbers must be zero (or close to it).

In other words, the histogram diagram of the “loss” column in the “train” set is the following diagram. So the correct prediction histogram diagram should look like this diagram.

But if you look at the different predictions that have been made for this challenge, the histogram of all of them looks like the diagram below. This means that all classification and regression methods only minimize the RMSE‌ equation and unfortunately will not come close to the correct prediction.

We used XGBoost and CatBoost in our first and second notebooks, and then “Ensembling” our results with the results of several other public notebooks to improve our score. But the histogram of the results of all the notebooks was similar to the diagram above, meaning that all the results were captured in the “RMSE Cage”. In the third and fourth notebooks, we looked at the same problem and offered some approximate ways to improve the score.

We have selected seven excellent notebooks in this challenge, each with a very good score. These notebooks have used the following methods:

1- Voting (CB/XGB/LGBM) >>> Score: 7.86780
2- tabular denoising + residual network >>> Score: 7.86595
3- LightAutoML >>> Score: 7.86259
4- LightGBM >>> Score: 7.86132
5- LGBM >>> Score: 7.85852
6- LightAutoML Classifier-Regressor Mix >>> Score: 7.85308
7- LGBM/XGB/CatBoost >>> Score: 7.85239

The diagram below contains the histogram of the results of these seven notebooks. As you can see, they all fit in a single box, which we call the “RMSE Cage”.

The vertical axis of all histograms is approximate “np.mean(y)=6.81392”. As you can see, all the predictions are around this number. If we draw the y histogram and the prediction histogram of these seven notebooks in one diagram, the difference between the correct values and the predictions becomes quite clear.

So what to do?

RMSE evaluation is weak and should not be used for multi-classification as well as regression with scattered results. Because the results of all methods will be inside the “RMSE Cage”. Of course, as mentioned, this is mathematically provable. However, if you run a challenge that uses “RMSE Evaluation”; There are a few things to keep in mind:

“Ensembling” methods may be able to improve your notebook score, but the scatter of your results will not increase and will not actually help this problem (RMSE Cage).
The order and ranking of the numbers predicted by the notebooks is not accurate. This means that with the slightest chance, the RMSE equation may be out of the optimal state. And in practice, we can not simply increase the scatter of numbers.
In our third notebook, we introduced an approximate method called “Coordinate with constant values”. This method is based on adding a fixed value to some results and also subtracting another fixed value from some other results. To do this successfully, you have to “Trial and error” many times.

In our fourth notebook, we introduced another method called “Snap to Curve”. Our idea is to “Snap to Curve” based on a “Guide Point”. The best “Guide Point” is determined based on the evaluation equation “RMSE”. The value “np.mean(yy)” is appropriate here, but we do not know what that number is. So first we consider “np.mean(y)” and then we modify its value with a number “R”.

Good Luck.

Somayyeh Gholami & Mehran Kazeminia