How to teach your regression the distinction between relevant outliers and irrelevant noise

You are working as a Data Scientist for the FC Barcelona and took on the task of building a model that predicts the value increase of young talent over the next 2, 5, and 10 years. You might want to regress the value over some meaningful metrics such as the assists or goals scored. Some might now apply this standard procedure and drop the most severe Outliers from the dataset. While your model might predict decently on average, it will unfortunately never understand what makes a Messi (because you dropped Messi with all the other "outliers").
The idea of dropping or replacing outliers in regression problems comes from the fact that simple linear regression is comparably prone to extremes in the data. However, this approach would not have helped you much in your role as Barcelona’s Data Scientist. The simple message: Outliers are not always bad!
We usually want to learn every last bit of the true data generating process (DGP). Consequently, we need to apply a Regression approach that is robust enough to distinguish between meaningful extremes and non-meaningful outliers. The following case illustrates these issues. We are going to compare several robust regression techniques by their performance on a straightforward regression problem. All approaches are implemented from scratch in my repo for this article.
A Very Brief Recap On Linear Regression
T[here](https://towardsdatascience.com/everything-you-need-to-know-about-linear-regression-b791e8f4bd7a) are loads of excellent articles describing the maths and intuition behind linear regression in detail. If you need a detailed refresher check here or here. For our sake, let’s just remind us that we are aiming to predict a given target y for a data point x.
The general and matrix form of our regression is the following:

Our goal of training the regression model is learning the illustrated parameter vector beta. To get there we minimize a pre-defined cost function that describes the residuals with any possible beta vector.
Data Generation
To benchmark existing regression techniques on their robustness, we firstly need a dataset that illustrates the concept of extremes and outliers. We chose a Student-t distributed set of points with 4 degrees of freedom, transformed with the sine as our Ground Truth DGP. We then add some strong outliers in the y-direction.

(Ordinary) Least Squares (LS)
You are applying the out of the box linear regression function in Excel, Python, or any other tool? Then OLS is most likely what you are using. Indeed, Least Squares does make a great choice since its cost function is convex and differentiable at 0. That makes it incredibly easy to optimize. The plot below visualizes that minimizing the simple parabolic Least Squares Loss function is high school math. Besides, the Ordinary Least Squares shortcut gives a one-step estimation of your solution.

Sounds flawless? Sadly, simplicity comes at a price. In the equation, we read that non-negativity of the residuals is ensured by squaring. This results in a smooth function but also increases the weight we place on large residuals. Effectively, we teach our regression line to consider outliers as quadratically more important to values with a close fit. That makes the Least Squares particularly prone to outliers. We can witness the devastating result when running a simple Least Squares Regression on our dataset. So what can we do to improve this situation?

_You can find the from-scratch implementation of the Least Squares Regression in my Repo for this article or common libraries such as Scikitlearn or Statsmodels._
Upgrade #1: Least Absolute Deviation
Given our discussion of least squares, it is straight forward to simply discard the squared residual penalty. The Least Absolute Deviation ensures positivity by evaluating the absolute instead of squared residuals. The impact is visible in the shape of the cost function below.

Comparing with the Least Squares above, the disadvantaging properties concerning differentiability are visible by observation. However, in many situations, this makes the difference. In our case, it depends on our performance metric of choice. Looking at the Mean Squared Error (MSE), LS regression outperformed the Least Deviation by a factor of ~3.3x. However, Least Deviation outperformed Least Squares by a factor of ~5.1x in terms of Mean Absolute Error (MAE). We now have to prioritize. Visually inspecting the regression lines our choice would fall on the Least Deviation Regression. But visual plot analysis should not be the model deciding factor… So let’s go deeper!

You can find the from-scratch implementation of the Least Deviation Regression in my Repo for this article.
Upgrade #2: Huber Loss
Squared Loss and Absolute Deviation both have their strengths and weaknesses. Now, what if we had a technique that combines both in one? The swiss statistician Peter Huber had the same idea.
Huber Loss combines a quadratic penalty for small errors with a linear one for large errors. With which of these we penalize our residuals depends on the new hyperparameter delta. Delta is the threshold that decides which penalty to apply to a given data point. This mechanism results in a hybrid loss function. The function becomes increasingly linear, is still differentiable at 0, and depends on the hyperparameter delta. To visualize, check the plot below.

Delta is an absolute measure. Consequently, there is no general default value that might ensure good performance. The choice entirely depends on the underlying data and which deviations we are willing to accept.
In the case of our data, the best regression line evaluating MSE and MAE fits with Delta = 1. It results in an MSE of 10.0 and an MAE of 0.75. But we can do better!

_Look at the from-scratch implementation of this loss function in my Repo for this article or common libraries such as Scikitlearn or Statsmodels._
Upgrade #3: Quantile Regression
We have seen how the usage of absolute values can outperform the Least Squares. We have also seen how a flexible equation that decides on a case to case basis which penalty to apply can improve that performance even more. Finally, let’s look at Quantile Regression.
If you will, Quantile Regression is a special case of the Least Deviation. The observant reader might have noticed that the Least Deviations technique effectively estimates the conditional median of our DGP. Since the median is nothing but the middle quantile, why not designing this flexibly? Quantile Regression estimates the quantile, specified by the new Hyperparameter Tau. To improve your intuition, the Quantile Regression line with Tau = 0.5 should be very similar to the regression line we have seen when applying simple Least Deviations. Again here, a well-performing Tau is dependent on the dataset we are working with.
Applied to our example, Quantile Regression wins the prize. At Tau=0.1 we attain an MSE of 0.66 and an MAE of 0.2.

_Look at the from-scratch implementation of the Quantile Regression in my Repo for this article or at common libraries such as Statsmodels._
Conclusion & Outlook
Dropping outliers that exceed a certain confidence range could easily go south if we are modeling real-world data. In our search for more reliable options, we discussed several techniques that improve the out of the box linear regression significantly. Specifically, we focused on teaching our model to distinguish between outliers that we do not want our model to learn and extreme values that bear valuable information on the underlying process we are striving to understand.
In our case, Quantile Regression took the prize. However, just as simplicity also performance comes at a cost. Quantile Regression does not have a continuous loss function but requires us to optimize a linear program. In practice, the model choice is always one of many variables. This collection provides you with a range of tools to improve your awareness of which outliers you can drop and which might bear valuable information.
I would love to chat with you about these and other topics, so don’t hesitate to reach out.
Some Further Readings & References
ROUSSEEUW (1984) ROUSSEEUW & VAN DRIESSEN (2005) KOENKER (2008): Quantile Regression BURRUS: Iterative Reweighted Least Squares
https://stats.idre.ucla.edu/r/dae/robust-regression/ https://scikit-learn.org/stable/modules/linear_model.html#robustness-regression-outliers-and-modeling-errors https://docs.pymc.io/notebooks/GLM-robust-with-outlier-detection.html#Load-Data https://jermwatt.github.io/machine_learning_refined/notes/5_Linear_regression/5_3_Absolute.html